As I recall, this thread started out about regression through the origin, the R^2 for which
I have not thought about in years and don't remember what I concluded. Sorry about how long this got.
1) R^2 does not measure model fit, it is a measure of predictive ability.
It is easy to find incorrect models with high R^2s and to construct perfect models with low R^2.
R^2 works fine as a measure of fit for COMPARING models on a single population with the usual caveat
that fitting more worthless variables typically raises R^2.
2) R^2 estimates the multiple correlation coefficient, i.e., the maximum of the correlation between y and any linear predictor.
More interestingly, it is also the estimated squared correlation between a random observation and its (theoretical) best linear predictor.
As such, it is estimating something defined by its expectation over both x and y, which is why
R^2 rarely works for comparing different data sets because they have different x distributions.
If you are doing nonlinear prediction, estimating [Corr(y,yhat)]^2 seems the obvious thing to do
and continues to share many of the nice properties of the R^2 from best linear prediction.
The theoretical best (nonlinear predictor), E(y|x), maximizes [Corr(y,yhat)]^2.
E(y|x) can be hard to estimate but it has the same mean as y,
so requiring mean(yhat) = ybar is little enough to ask of any estimated predictor yhat.
You can even fit a transformed y variable, then back transform the predictions to the original
scale, and use [Corr(y,yhat)]^2 to compare that to untransformed fits.
[Corr(y,yhat)]^2 also works fine when y in binary regression but in that case no interesting data ever predict well.
(To predict well, the cases have to have probabilities mostly near 0 or 1.)
However such cases rarely satisfy the condition mean(yhat)=ybar.
3) Yes, you can have lousy prediction with high values of [Corr(y,yhat)]^2 but that
is easy to fix. Create a new predictor by regressing y on yhat.
4) Regression through the origin is pretty strange stuff and seems inappropriate unless you
are actually collecting data near x=0 (because you are imposing a condition that is outside the
range of reasonable approximation). In particular, it assumes E(y|0)=0. Regression with
any assumption E(y|x_0)=y_0 does not seem to be very common.
In linear models with an intercept (vector J), [Corr(y,yhat)]^2 is the square of the cosine of the angle
between the vectors (Y-ybar J) and (Yhat-ybar J). Small angles mean good agreement and a cosine near one.
As mentioned earlier, this does not necessarily mean good prediction but that is easy to fix. [Corr(y,yhat)]^2 seems most
sensible when mean(yhat) = ybar, which always happens with least squares fits in linear regression with an intercept.
Without mean(yhat) = ybar (say regression through the origin), you still get the squared cosine of the angles
between Y and Yhat as (Y'Yhat)^2/(Y'Y)(Yhat'Yhat). Again, that does not mean yhat is a good predictor but it is easy
to fix by regressing y on yhat either with or without an intercept.
Relative to a linear model with an intercept fitted by least squares,
this formula just adds the correction factor n(ybar)^2 to the numerator and denominator of
R^2=SSReg/SST which makes the number (much) closer to one.
(This new "R^2" gives us a lot of undeserved credit for being smart enough to fit a mean value to the data, something which is not being done in
regression through the origin.)
The formula [Corr(y,yhat)]^2 = R^2=1-SSE/SST only works because it has built into it the
least squares estimation of the best linear predictor of y. If you do robust or penalized estimation,
even on a linear model, it can break down. It requires
(Y-ybar J)' (Yhat-ybar J) = (Yhat-ybar J)' (Yhat-ybar J) = Y'Y - (Y-Yhat)'(Y-Yhat)
all of which works for least squares estimation in linear models with an intercept but
is not likely to work otherwise. In particular, it does work after creating a new predictor by
regressing y on an old predictor and an intercept.
(Least squares estimation for linear models without
an intercept has (Y'Yhat)^2/(Y'Y)(Yhat'Yhat) = 1-SSE/Y'Y.
More generally, best linear prediction through the origin seems completely analogous to least squares regression
through the origin.)
------------------------------
Ronald Christensen
Univ of New Mexico
------------------------------
Original Message:
Sent: 06-02-2020 13:59
From: Leland Wilkinson
Subject: R^2
Thanks, everyone, for your helpful contributions on this topic. I am still bothered by the terminology. The "textbook" coefficient of determination (COD) formula for R^2 isn't the square of anything and it isn't a correlation, except in a limited OLS case. It is a misnomer perpetuated by people's desire to have a quasi-R^2 statistic they can extend to nonlinear and machine learning models. As you know, there are several other quasi-R^2 statistics, and none of them is very satisfying. I've also decided to stop using the terminology R^2 for this statistic and instead call it by its more proper name, Coefficient of Determination.
There was one correction, made by two commenters, that the COD cannot be negative in simple nonlinear regression. I misstated that case. For other models, however, it definitely can be negative.
The discussion reminded me of one of my favorite papers by my dissertation advisor. (https://en.wikipedia.org/wiki/Abelson%27s_paradox). Robert Abelson was a student of John Tukey and we used the pre-publication copy of Tukey's EDA book in his analysis of variance course. Like Tukey, Abelson resisted the oversimplifications brought on by the worship of single summary statistics. He showed in this paper how important scientific results were often paradoxically associated with small R^2 values. It was then, in the early 1970's, that I also learned how important graphical data analysis was. Abelson and Tukey taught us that presenting statistical (or machine learning) models without examining residuals graphically is irresponsible.
Thanks for this discussion. I learned a lot, which was my original intention
------------------------------
Leland Wilkinson
H2O
Original Message:
Sent: 05-29-2020 10:24
From: Leland Wilkinson
Subject: R^2
Forgive me for such an elementary question, but it seems to come up on the help lines I've seen running all the way from SYSTAT in the 1980's to H2O (where I now work). The question arises from the customary formula for R^2 when used as a coefficient of determination, namely 1 - rss/tss. Under certain instances (e.g., regression without a constant) this number can be negative. At H2O, where we use R^2 as a goodness of fit measure for a variety of classical statistical and machine learning models, we use an obvious alternative: it is the square of the Pearson correlation between the observed and fitted values on the predicted variable -- that is unarguably a goodness of fit measure and it corresponds to the 1 - rss/tss formula in the appropriate instances. Now, I have nothing against the 1 - rss/tss formula, but don't users understand that it is defined (and arose historically) only for the restricted case of linear regression with a constant? In response, some users say that they want to see a negative value of R^2 as an indicator that their model is doing worse than the null model. But to them, I say the proper thing for a stat program to print is "undefined" or "missing" or maybe that R is an imaginary number! As for Wikipedia entries and numerous intro stat books, this isn't a popularity contest. Extending the 1 - rss/tss value to negative values requires some sort of argument regarding its meaning. I haven't heard one.
------------------------------
Leland Wilkinson
H2O
------------------------------