No matter how you label it or deal with it, a statistical model that is fundamentally flawed needs to be rethought.
Using log(Y1/Y2) = log(Y1) - log(Y2), the model in question is
[1] log(Y1) - log(Y2) = b.0 + b.1*Y1 + b.2*Y2 + b.X*X (+ noise).
Suppose Y1 is logNormal(meanlog=3.5, sdlog=0.4), which gives 0.025, 0.50, and 0.975 quantiles near 15, 33, and 73. Likewise, let Y2 be logNormal(meanlog=3.2, sdlog=0.4) giving the same quantiles of near 11, 24 and 54. These reflect the kind of distributions we often see in practice. Here, even when Y1 and Y2 are independent, a simulation with N = 1 million observations (R code below) reveals that the correlation between log(Y1) - log(Y2) and Y1 - Y2 is about 0.95. Thus, Model [1] is what I call a tautological model: its dependent variable is being predicted by a functional near-twin of itself.
Accordingly, since b.1 and b.2 are wholly interesting in [1], I presume that b.X is the focal parameter, the one that tightly quantifies the essential research question.
Let us consider a common, similar problem.
Suppose we need to compare two independent groups (G=0 vs. 1) with respect to how much Y changes from baseline (Y1) to some specific follow-up time (Y2). If Y is logNormal-like, we could use a model somewhat like [1],
[2] log(Y2/Y1) = log(y2) - log(y1) ~ b.0 + b.1*log(Y1) + b.G*G
However, using base 2 logging eases interpretation: a one unit change in log2(x) is a doubling or halving of x. Thus
[2*] log2(Y2/Y1) = log2(y2) - log2(y1) = b.0 + b.1*log2(Y1) + b.G*G
is functionally identical to [2] but easier (at least for me) to work with. Either one addresses how much the two groups changed over time given that we "adjusted for baseline." b.1 is rather uninteresting. b.G is the focal parameter.
But models [2] and [2*] have the same problem as Model [1]. However, one loses nothing and gains simplicity by using the model
[3] log(y2) ~ b.0 + b.1*log(Y1) + b.G*G
or
[3*] log2(y2) ~ b.0 + b.1*log2(Y1) + b.G*G
Exponentiating Models [3] and [3*] makes them readily interpretable.
[3i] exp(log(y2)) = y2 ~ 2^(b.0 + b.1*log(Y1) + b.G*G)
y2 ~ exp(b.0) * exp(b.1)^log(y1) * exp(b.G)^G
[3*i] 2^(log2(y2)) = y2 ~ 2^(b.0 + b.1*log2(Y1) + b.G*G)
y2 ~ (2^b.0) * (2^b.1)^log2(y1) * (2^b.G)^G
At this point, exp(b.G) in [3i] equals 2^b.G in [3i*]. If 2^b.G = 1.65 in [3i*], then comparing two hypothetical subjects who have the same Y1 value but are in different groups, the Y2 for the G=1 case tends to have a 65% greater (times 1.95) than the Y2 for G=0 case. If 2^b.1 = 1.80 in [3*i], then comparing two subjects in the same group with Subject A having Y1 that is twice Subject B's Y1, A's Y2 tends to be 80% greater (times 1.80) than B's Y2.
Notes. When there is virtually no relationship between the baseline (Y1) and the follow-up score (Y2), the model should only be
[4] log(Y2) ~ b.0 + b.G*G
or
[4*] log2(Y2) ~ b.0 + b.G*G
and NOT the model log(Y1/Y2) ~ b.0 + b.G*G. Also, if the relationship between Y1 and Y2 differs between groups, then you are faced with building a sound interaction model. This is trickier than many people seem to realize. I've witnessed good professional statisticians code the model in a correct way but incorrectly interpret its coefficients.
For much more on this problem, study Senn, S. (2006). Change from baseline and analysis of covariance revisited. Stat Med, 25(24):4334–44.
So how do you apply the principles just covered to handle the "mathematically coupled" model, [1]? To opine on that requires knowing far more about the study's design, its research questions, and its variables than what has been described.
But I might start by applying the modeling corollary to Occam's Razor and George Box's "All models are wrong; some are useful." Ask: Is log2(Y1/Y2) ~ b.0 + X or (often better, when X is logNormal like) log2(Y1/Y2) ~ b.0 + log2(X) useful enough? If log(Y1/Y2) needs to be adjusted for some general magnitude of Y1 and Y2, then one could add meanY = (Y1+Y2)/2 or log2(gmeanY) = log2(sqrt(Y1*Y2)) as the "adjustment" predictor. In the simulation mentioned above, the correlation between log(Y1/Y2) and log((sqrt(Y1*Y2)) was nearly 0.00, but that data generation had no built-in relationship between Y1/Y2 and sqrt(Y1*Y2).
So, could it be that
[5] log(Y1/Y2) = b.0 + b.1* + b.X*log((sqrt(Y1*Y2)) (+ noise)
will work in this application?
Finally, per Voltaire, keep in mind that perfection is the enemy of done.
R code:
mean.logy1 = 3.5
mean.logy2 = 3.2
sd.logy = 0.4
set.seed(170322)
y1 <- rlnorm(1000000,mean.logy1,0.4)
qlnorm(c(0.025,0.50,0.975),mean.logy1,0.4)
# [1] 15.11994 33.11545 72.52894
y2 <- rlnorm(1000000,mean.logy2,0.4)
qlnorm(c(0.025,0.50,0.975),mean.logy2,0.4)
# [1] 11.20113 24.53253 53.73076
cor(y1-y2, log(y1/y2))
# [1] 0.9498388
cor(log(y1/y2),log(sqrt(y1*y2)))
# [1] 0.0009271855
------------------------------
Ralph O'Brien
Professor of Biostatistics (officially retired; still keenly active)
Case Western Reserve University
http://rfuncs.weebly.com/about-ralph-obrien.html------------------------------
Original Message:
Sent: 04-17-2018 16:52
From: Kieran McCaul
Subject: Mathematical Coupling - Is it possible to overcome it?
There is another paper by Tu and Gilthorpe that might want to have a look at:
Tu YK and Gilthorpe MS (2007). Revisiting the relation between change and initial value: a review and evaluation. Stat Med 26(2): 443-457.
Mathematical coupling arises when people are looking at change in a variable from baseline to some time,t., while accounting for the baseline level of the variable.
For example, in a RCT of an antihypertensive medication you would have blood pressure at baseline BP0, and at 6 months, BP1, and a treatment indicator TREAT.
Mathematical coupling arises when you regress the change in BP on the baseline value:
BP1-BP0 = BP0 + TREAT.
Apparently people used to do stuff like this (and probably still do).
It also arises in studies of agreement between two different measures. Problems in studies like these led to the paper by Bland and Altman, which was itself strongly influence by the more general work of Oldham.
Bland JM and Altman DG (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1(8476): 307-310.
Oldham PD (1962). A note on the analysis of repeated measurements of the same subjects. J Chronic Dis 15: 969-977.
------------------------------
Kieran McCaul
Research Associate Professor
University of Western Australia
Original Message:
Sent: 04-16-2018 14:00
From: Isabella Ghement
Subject: Mathematical Coupling - Is it possible to overcome it?
Hi everyone,
While working on a project, I ran into the issue of mathematical coupling and am wondering if there is any way to overcome it.
For this project, I am fitting a linear regression model like the one below:
log(Abundance_1/Abundance_2) = beta0 + beta1*Abundance_1 + beta2*Abundance_2 + beta3*Environmental + error
where Abundance_1 and Abundance_2 are yearly measures of fish abundance in regions 1 and 2, respectively, and Environmental is an environmental variable measured yearly across both regions.
For this model, mathematical coupling arises because the outcome variable log(Abundance_1/Abundance_2) is mathematically related with the predictors Abundance_1 and Abundance_2.
After reading the article Misuses of correlation and regression analyses in orthodontic research: The problem of mathematical coupling by Yu-Kang Tu and co-authors, I understand that mathematical coupling between log(CPUE_BC/CPUE_US) and the predictors CPUE_BC and CPUE_US obscures the relationship between log(CPUE_BC/CPUE_US) and the environmental variable. This is because the variance in the values of the outcome variable log(CPUE_BC/CPUE_US) is almost completely explained by CPUE_BC and CPUE_US and there is very little or no variance remaining to be explained by the environmental variable, whose relationship with log(CPUE_BC/CPUE_US) is of specific interest.
What I don't understand yet is if there is a principled way to estimate the relationship between log(CPUE_BC/CPUE_US) and the environmental variable while guarding against the ill-effects of mathematical coupling.
Any ideas or references you can share that would help me estimate this relationship are greatly appreciated.
Thanks very much,
Isabella
------------------------------
Isabella R. Ghement, Ph.D.
Ghement Statistical Consulting Company Ltd.
E-mail: isabella@ghement.ca
Tel: 604-767-1250
------------------------------