ASA Connect

 View Only
Expand all | Collapse all

Mathematical Coupling - Is it possible to overcome it?

  • 1.  Mathematical Coupling - Is it possible to overcome it?

    Posted 04-16-2018 14:00
    ​Hi everyone,

    While working on a project, I ran into the issue of mathematical coupling and am wondering if there is any way to overcome it.

    For this project, I am fitting a linear regression model like the one below:

        log(Abundance_1/Abundance_2) = beta0 + beta1*Abundance_1 + beta2*Abundance_2 + beta3*Environmental + error 

    where Abundance_1 and Abundance_2 are yearly measures of fish abundance in regions 1 and 2, respectively, and Environmental is an environmental variable measured yearly across both regions.   

    For this model, mathematical coupling arises because the outcome variable log(Abundance_1/Abundance_2) is mathematically related with the predictors Abundance_1 and Abundance_2. 

    After reading the article Misuses of correlation and regression analyses in orthodontic research: The problem of mathematical coupling by Yu-Kang Tu and co-authors, I understand that mathematical coupling between log(CPUE_BC/CPUE_US) and the predictors CPUE_BC and CPUE_US obscures the relationship between  log(CPUE_BC/CPUE_US) and the environmental variable.  This is because the variance in the values of the outcome variable log(CPUE_BC/CPUE_US) is almost completely explained by CPUE_BC and CPUE_US and there is very little or no variance remaining to be explained by the environmental variable, whose relationship with log(CPUE_BC/CPUE_US) is of specific interest.   

    What I don't understand yet is if there is a principled way to estimate the relationship between log(CPUE_BC/CPUE_US) and the environmental variable while guarding against the ill-effects of mathematical coupling. 

    Any ideas or references you can share that would help me estimate this relationship are greatly appreciated. 

    Thanks very much,

    Isabella



     

    ------------------------------
    Isabella R. Ghement, Ph.D.
    Ghement Statistical Consulting Company Ltd.
    E-mail: isabella@ghement.ca
    Tel: 604-767-1250
    ------------------------------


  • 2.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 08:07

    Idea: include Abundance_1 or Abundance_2, but not both, as predictors.

     

    Except for the logarithmic relationship on the left, your model looks a lot like the situation where the model is W = beta0 + beta1*X1 +beta2*X2 + error, but the outcome W = Y minus X1. I remember reading a paper about this situation some 20 years ago. Sadly, I don't remember the paper's author or the journal.


    Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.





  • 3.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 08:09
    Hi Isabella:  An interesting question.  May I have a full reference for the article  that started you journey.  Thank you.  Lawrence Lessner

    ------------------------------
    Lawrence Lessner
    Research Scientist and Retired
    Institute of Health and the Environment, SUNY, Albany
    ------------------------------



  • 4.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 09:28
    In this situation, I might pause and reconsider the motivation for running the regression as specified.  There's a simple model for log(Abundance_1/Abundance_2) as a function of Abundance_1 and Abundance_2 that doesn't require estimating any parameters and explains all of its variance, namely: the observed log-ratio of the Abundance_1 and Abundance_2.  So there must be some other motivation for trying to estimate the proposed, mis-specified model that has Abundance_1 and Abundance_2 entering in additively.  Perhaps this motivation could be addressed with a different technique altogether, such as generalized additive models or quantile regression.

    ------------------------------
    Andrew McDavid
    Biostatistics and Computational Biology
    University of Rochester
    ------------------------------



  • 5.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 09:57
    Edited by Edward Cashin 04-17-2018 09:57
    Hello.  I'm also trying to understand the motivation for including (a different form of) the outcome information on the right hand side.  What would be unsatisfactory about the model if you treated the log abundance ratio as depending only on the environment plus noise?

    log(Abundance_1/Abundance_2) = beta0 + beta3*Environmental + error

    ------------------------------
    Edward Cashin
    Research Scientist II
    ------------------------------



  • 6.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 10:15
    Unless I misunderstand your model, it seems to me you have at least three problems:

    1. Two of your regressors (Abundance_1 and Abundance_2) are not true independent variables to be considered to be regressors.
    2. If both Abundance_1 and Abundance_2 are measured variables, then there is errors associated with them.
    3. The ratio between the two Abundances is a difficult dependent variable.  For example, supposing they both have normally distributed errors, the error structure of the ratio is not so.  You will then have to approximate the error by Taylor series expansion.

    Simple multiple regression (even with appropriate weightings, if you could find such) does not work with such cases.  You will first have to consider regression with errors in both dependent and independent variables.  In your case, that will be hard to do.  Is there a possible reparameterization of your model?  Can you device a dependent variable for your purpose that does not contain any of the three independent variables? 

    I am afraid I cannot suggest a solution but point out the problems with your model.  Maybe some other statistician can.  Even in that case, I would be careful about the problems I pointed out in the background.

    Ajit K. Thakur, Ph.D.
    Retired Statistician





  • 7.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 10:37
    Hello, Isabella.  what about thinking about your project in ODE terms?  I'm drawing somewhat on ideas from system dynamics (see Andy Ford's Modeling the Environment, for example).

    For example, you might have two stocks of fish (state variables, if you prefer): the number of fish in BC waters and the number of fish in US waters. Speaking simplistically, each has a birth rate, each may have an immigration rate, each has a death rate, each has a catch rate, and each may have an emigration rate.  Rate in this case is measured in fish per unit time--perhaps per year, given your problem description.

    It's easier to see when drawing a diagram, and you may determine that some of these rates are negligibly small in the case of the fish you're considering.  You may also think of additional stocks and rates, but there is a limit to how complex a model you might be able to fit with the data you may have.

    It's possible that some of the parameters controlling these rates are common to species, it's possible that others are common to region, and it's possible that some are common to both.

    Now the outcome you're seeking to understand is a function of the underlying generative fish population model you've created.

    I can think of several ways to model and draw inferences from such a model.  Vensim Pro or Vensim DSS would make it easy to create such a model by drawing it and then filling in some of the equations.  I know Vensim DSS has the ability to do Powell optimization of such a model, and I think Vensim Pro can, as well.  Except for the optimization part, Vensim PLE should be able to model your situation.

    You could also consider modeling this in the Stan language (using rstan, if you're an R user). That's probably a bit harder to model, but it lets you do a Bayesian multilevel model in which you can model the parameters in BC and in the US as related but not identical. 

    GNU MCSim has many of the attributes of Stan.  It's sampler isn't as advanced, but I find the language more expressive for describing ODEs.

    There's a whole set of R packages in the deSolve family that can do much of this, too, although I don't know if they can do the multilevel modeling and Bayesian inference. 

    Bill   


















    ------------------------------
    Bill Harris
    Data & Analytics Consultant
    Snohomish County PUD
    ------------------------------



  • 8.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-18-2018 12:48

    What are ODE terms?


    Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.





  • 9.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-18-2018 19:16
    Sorry: ODE = ordinary differential equations.  I'm proposing more of what some would call a state variable approach to the problem than a static regression analysis.

    ------------------------------
    Bill Harris
    Data & Analytics Consultant
    Snohomish County PUD
    ------------------------------



  • 10.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-17-2018 16:52

    There is another paper by Tu and Gilthorpe that might want to have a look at:
    Tu YK and Gilthorpe MS (2007). Revisiting the relation between change and initial value: a review and evaluation. Stat Med 26(2): 443-457.

    Mathematical coupling arises when people are looking at change in a variable from baseline to some time,t., while accounting for the baseline level of the variable.
    For example, in a RCT of an antihypertensive medication you would have blood pressure at baseline  BP0, and at 6 months, BP1, and a treatment indicator TREAT.
    Mathematical coupling arises when you regress the change in BP on the baseline value:
    BP1-BP0 = BP0 + TREAT.
    Apparently people used to do stuff like this (and probably still do).

    It also arises in studies of agreement between two different measures.  Problems in studies like these led to the paper by Bland and Altman, which was itself strongly influence by the more general work of Oldham.

    Bland JM and Altman DG (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1(8476): 307-310.
    Oldham PD (1962). A note on the analysis of repeated measurements of the same subjects. J Chronic Dis 15: 969-977.



    ------------------------------
    Kieran McCaul
    Research Associate Professor
    University of Western Australia
    ------------------------------



  • 11.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-19-2018 12:07
    Edited by Ralph O'Brien 04-20-2018 17:18

    No matter how you label it or deal with it, a statistical model that is fundamentally flawed needs to be rethought.

    Using log(Y1/Y2) = log(Y1) - log(Y2), the model in question is

       [1]   log(Y1) - log(Y2) = b.0 + b.1*Y1 + b.2*Y2 + b.X*X (+ noise).

    Suppose Y1 is logNormal(meanlog=3.5, sdlog=0.4), which gives 0.025, 0.50, and 0.975 quantiles near 15, 33, and 73. Likewise, let Y2 be logNormal(meanlog=3.2, sdlog=0.4) giving the same quantiles of near 11, 24 and 54. These reflect the kind of distributions we often see in practice. Here, even when Y1 and Y2 are independent, a simulation with N = 1 million observations (R code below) reveals that the correlation between log(Y1) - log(Y2) and Y1 - Y2 is about 0.95. Thus, Model [1] is what I call a tautological model: its dependent variable is being predicted by a functional near-twin of itself.

    Accordingly, since b.1 and b.2 are wholly interesting in [1], I presume that b.X is the focal parameter, the one that tightly quantifies the essential research question.

    Let us consider a common, similar problem.

    Suppose we need to compare two independent groups (G=0 vs. 1) with respect to how much Y changes from baseline (Y1) to some specific follow-up time (Y2). If Y is logNormal-like, we could use a model somewhat like [1],

       [2]   log(Y2/Y1) = log(y2) - log(y1) ~ b.0 + b.1*log(Y1) + b.G*G

    However, using base 2 logging eases interpretation: a one unit change in log2(x) is a doubling or halving of x. Thus

       [2*]   log2(Y2/Y1) = log2(y2) - log2(y1) = b.0 + b.1*log2(Y1) + b.G*G

    is functionally identical to [2] but easier (at least for me) to work with. Either one addresses how much the two groups changed over time given that we "adjusted for baseline." b.1 is rather uninteresting. b.G is the focal parameter.

    But models [2] and [2*] have the same problem as Model [1]. However, one loses nothing and gains simplicity by using the model

       [3]   log(y2) ~ b.0 + b.1*log(Y1) + b.G*G

    or

       [3*]   log2(y2) ~ b.0 + b.1*log2(Y1) + b.G*G

    Exponentiating Models [3] and [3*] makes them readily interpretable.

       [3i]   exp(log(y2)) = y2 ~ 2^(b.0 + b.1*log(Y1) + b.G*G)

                     y2 ~ exp(b.0) * exp(b.1)^log(y1) * exp(b.G)^G

       [3*i]    2^(log2(y2)) = y2 ~ 2^(b.0 + b.1*log2(Y1) + b.G*G)

                     y2 ~ (2^b.0) * (2^b.1)^log2(y1) * (2^b.G)^G

    At this point, exp(b.G) in [3i] equals 2^b.G in [3i*]. If 2^b.G = 1.65 in [3i*], then comparing two hypothetical subjects who have the same Y1 value but are in different groups, the Y2 for the G=1 case tends to have a 65% greater (times 1.95) than the Y2 for G=0 case. If 2^b.1 = 1.80 in [3*i], then comparing two subjects in the same group with Subject A having Y1 that is twice Subject B's Y1, A's Y2 tends to be 80% greater (times 1.80) than B's Y2.

    Notes. When there is virtually no relationship between the baseline (Y1) and the follow-up score (Y2), the model should only be

       [4]   log(Y2) ~ b.0 + b.G*G

    or

       [4*]   log2(Y2) ~ b.0 + b.G*G

    and NOT the model log(Y1/Y2) ~ b.0 + b.G*G. Also, if the relationship between Y1 and Y2 differs between groups, then you are faced with building a sound interaction model. This is trickier than many people seem to realize. I've witnessed good professional statisticians code the model in a correct way but incorrectly interpret its coefficients.

    For much more on this problem, study Senn, S. (2006). Change from baseline and analysis of covariance revisited. Stat Med, 25(24):4334–44.

    So how do you apply the principles just covered to handle the "mathematically coupled" model, [1]? To opine on that requires knowing far more about the study's design, its research questions, and its variables than what has been described.

    But I might start by applying the modeling corollary to Occam's Razor and George Box's "All models are wrong; some are useful." Ask: Is log2(Y1/Y2) ~ b.0 + X or (often better, when X is logNormal like) log2(Y1/Y2) ~ b.0 + log2(X) useful enough? If log(Y1/Y2) needs to be adjusted for some general magnitude of Y1 and Y2, then one could add meanY = (Y1+Y2)/2 or log2(gmeanY) = log2(sqrt(Y1*Y2)) as the "adjustment" predictor. In the simulation mentioned above, the correlation between log(Y1/Y2) and log((sqrt(Y1*Y2)) was nearly 0.00, but that data generation had no built-in relationship between Y1/Y2 and sqrt(Y1*Y2).

    So, could it be that

       [5]   log(Y1/Y2) = b.0 + b.1* + b.X*log((sqrt(Y1*Y2)) (+ noise)

    will work in this application?

    Finally, per Voltaire, keep in mind that perfection is the enemy of done. 

    R code:
    mean.logy1 = 3.5
    mean.logy2 = 3.2
    sd.logy = 0.4

    set.seed(170322)
    y1 <- rlnorm(1000000,mean.logy1,0.4)
    qlnorm(c(0.025,0.50,0.975),mean.logy1,0.4)
    # [1] 15.11994 33.11545 72.52894

    y2 <- rlnorm(1000000,mean.logy2,0.4)
    qlnorm(c(0.025,0.50,0.975),mean.logy2,0.4)
    # [1] 11.20113 24.53253 53.73076

    cor(y1-y2, log(y1/y2))
    # [1] 0.9498388
    cor(log(y1/y2),log(sqrt(y1*y2)))
    # [1] 0.0009271855



    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------



  • 12.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 05-30-2018 09:36
    Log Abundance ratio is a latent variable, the others are observed in a Structural Equation Model. Do not forget to model the autocorrelation or autoregression parameters between years or some smooth trend, maybe cosinor and linear. This will take you half-way between linear regression and fitting diferential equations.

    ------------------------------
    Reinhard Vonthein
    Universitaet Zu Luebeck
    ------------------------------



  • 13.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 04-19-2018 13:37
    If Abundance_1 and Abundance_2 are known independent variables, why do you want to use least squares at all?  Why not just calculate log(Abundance_1/Abundance_2) and be done with it?  There won't be any adjustable parameters or an error term.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org
    ------------------------------



  • 14.  RE: Mathematical Coupling - Is it possible to overcome it?

    Posted 05-31-2018 16:07
    ​Hi,

    What you are referring to as "mathematical coupling" is just another term for having a model with "endogenous variables" (as some of the other commentators have indirectly pointed at).   Although in current usage this is understood as having any explanatory variable in a model correlated with the error term, it originated from simultaneous equations models which, in many cases, have variables that are the dependent variable of one equation while being simultaneously an explanatory variable in another equation.   

    The risk in this case (as others have pointed out) is that if one does a simple OLS regression with endogenous variables, estimates and inferences can be biased.  The usual methods to deal with this situation is through the use of instrument variables and two stage least squares. For a discussion of this in tutorial form, see:

    http://ocw.uc3m.es/economia/econometrics/lecture-notes-1/Topic6_logo.pdf

    ------------------------------
    William Finnoff
    ------------------------------