ASA Connect

 View Only
Expand all | Collapse all

R^2

  • 1.  R^2

    Posted 05-29-2020 10:24
    Forgive me for such an elementary question, but it seems to come up on the help lines I've seen running all the way from SYSTAT in the 1980's to H2O (where I now work). The question arises from the customary formula for R^2 when used as a coefficient of determination, namely 1 - rss/tss. Under certain instances (e.g., regression without a constant) this number can be negative. At H2O, where we use R^2 as a goodness of fit measure for a variety of classical statistical and machine learning models, we use an obvious alternative: it is the square of the Pearson correlation between the observed and fitted values on the predicted variable -- that is unarguably a goodness of fit measure and it corresponds to the 1 - rss/tss formula in the appropriate instances. Now, I have nothing against the 1 - rss/tss formula, but don't users understand that it is defined (and arose historically) only for the restricted case of linear regression with a constant? In response, some users say that they want to see a negative value of R^2 as an indicator that their model is doing worse than the null model. But to them, I say the proper thing for a stat program to print is "undefined" or "missing" or maybe that R is an imaginary number! As for Wikipedia entries and numerous intro stat books, this isn't a popularity contest. Extending the 1 - rss/tss value to negative values requires some sort of argument regarding its meaning. I haven't heard one. 

    ------------------------------
    Leland Wilkinson
    H2O
    ------------------------------


  • 2.  RE: R^2

    Posted 05-31-2020 16:11
    Hi Lee,
    I couldn't agree with you more. I've been advocating using corr^2(yhat ,y) as a goodness of fit measure for years. It's equivalent to R^2 for regression, but as you point out, generalizes to any other model, neural nets, SVM's or whatever, much better. I've had this running argument with John Sall for about 20 years now.

    ------------------------------
    Dick De Veaux
    Vice President ASA 2019-2022
    Past Chair Statistical Learning and Data Science
    Williams College
    ------------------------------



  • 3.  RE: R^2

    Posted 06-01-2020 07:45
    Hi Lee, Hi Dick,

    I agree as well. R^2 in many situations does not sum up a model well.  Slightly more generally, I have found common situations where every summary performance metric works poorly (including likelihood).  I ended up going back to Deming's 1938 guidance "The object of collecting data is to provide a basis for action."  So, like many I begin modeling defining the purpose (or set of purposes) for which this model will be used.  (Yes, of course, always opportunistically looking for unanticipated uses too).  At the end, the measure of model performance is how well it does for that purpose.

    For example, models are often used to make a decision--above some predicted value "say yes," or "say no," or whatever.  Thus a Type 1 vs Type 2 error trade-off is called for.  I will weight the costs of each error and find the optimal threshold. (if the model will be used over a range, I will weight over that range)  This is the optimal performance of the model.  Note that this measure is not isotonic in likelihood (or R^2 or any other measure for that matter). This is the common experience of looking at the AUC (or Gini, etc) and seeing "worse" likelihood-based models perform better.  So, I will explore a family of models, and select the one that (out of sample) makes the best decisions.  Note again, that economic weighting of the error costs is essential to model selection--balancing Type 1 and Type 2 error rates one-to-one, of course, often produces dumb.

    I have puzzled a few times on how to directly search for such a decision-optimal model.  Never figured it out.  All the standard approaches (least square, logistic,...) are maximum likelihood, which does not find the optimal decision-making coefficients.  Even statistical learning approaches (boosted trees, RF, etc) deep inside, still have a likelihood measure of goodness to guide the growth, though they will often look at a family of likelihood-optimized models and select the decision-optimal best one.

    The search for optimal decision-making models probably needs some operations research class of optimization--outside of what I know how to explore.  In practice, exploring a family of models and selecting the best performer (on an out of sample) has been sensible.  (I appeal to Good's guidance on honest ad-hockery).

    Best,

    Bill Kahn

    ------------------------------
    William Kahn
    ------------------------------



  • 4.  RE: R^2

    Posted 06-02-2020 10:27
    Bill K.,

    You wrote, "For example, models are often used to make a decision" and "I have puzzled a few times on how to directly search for such a decision-optimal model." 

    Have you tried something like http://www.statsathome.com/2017/10/12/bayesian-decision-theory-made-ridiculously-simple/?   If so, what was your experience?

    Bill H.

    ------------------------------
    Bill Harris
    Data & Analytics Consultant
    Snohomish County PUD
    ------------------------------



  • 5.  RE: R^2

    Posted 06-02-2020 12:10

    This has been an interesting thread!  Inspired by the initial posts, John Sall, Russ Wolfinger,  Lee WIlkinson and I engaged in a prolonged email discussion yesterday. I love learning from smart people. Here are some of the insights that became clearer to me after our discussion:

    As we all know, for simple linear regression, the two measures R^2 = 1 -  SSE/SST and cor(yhat,y)^2 are the same.  The question is how do these compare as measures of goodness-of-fit when generalizing to more advanced models from machine learning like neural networks and random forests, among others.

    The disadvantage of  R^2= 1 - SSE/SST is that it can become negative when the fit is particularly bad, when the residual sum of squares is greater than the variation of y around its mean (SST). Now some might argue that that's a feature not a bug, since it highlights a particularly bad fit. Others don't like the fact that a quantity called R "squared" can go negative.

    Many researchers look at the plot of y vs predicted y on both the training and test sets when assessing models, and cor(yhat,y) measures the correlation between them. But there are some anomalous cases where it doesn't directly indicate a good model fit. That's because correlation doesn't change under linear transformations of the variables and so, for example in the case of linear regression without an intercept, cor(yhat,y)^2 can be very high even though the line is a terrible fit. In this case it simply means that there is a linear transformation of yhat that would be a good fit - which is not particularly useful. It doesn't help to know that a good model  would be a + bx when our model is c + dx. For this reason some researchers avoid it altogether.

    However, when used for testing and training for most models, these quantities are generally the same. R^2 on a test set can go negative, and in theory, cor(yhat, y)^2 can be high even when a linear transformation is needed. But these situations are rare. It might be good idea to consider both, and of course the RSME (root mean squared error) as an indication of the strength of the predictive ability as well. And, of course, for any model, it is wise to consider whether it makes sense and is useful for the problem at hand, not just whether it happened to predict well in this instance.



    ------------------------------
    Dick De Veaux
    Vice President ASA 2019-2022
    Past Chair Statistical Learning and Data Science
    Williams College
    ------------------------------



  • 6.  RE: R^2

    Posted 06-01-2020 11:56
    Leland, 

    The way I see a negative R^2 is that such a model predicts worse than one that just fits the average y (without accounting for X). I think that there is some value knowing this. 

    Best regards
    Sven

    ------------------------------
    Sven Serneels
    Director, Data Analytics
    .: posting as a private person :.
    ------------------------------



  • 7.  RE: R^2

    Posted 06-01-2020 12:07
      |   view attached

    Regression through the otigin

     

    SST = S yi2

     

    b = Sxiyi/Sxi2             SSR = b2Sxi2

     

    SSE = SST – SSR

     

    R2  =( SSR/SST) = 1 – (SSE/SST).

     

    This is never negative.



    ------------------------------
    [S.R.S. Rao] [Poduri][Professor of Statistics][University of Rochester][Rochester, NY 14627]
    ------------------------------

    Attachment(s)



  • 8.  RE: R^2

    Posted 06-01-2020 12:31
    Hi,

    Please note that sst in absence of intercept equals sum(y^2) not  (n-1)*var(y). R2= 1-rss/tss will never be negative if the former (the correct one) is used. It can be negative if the later def of sst (the wrong one) is used. This is correct only if you're fitting a model with an intercept. My best guess that you've been using sst derived a from model with an intercept with rss computed from model without intercept. 


    ------------------------------
    Mohammad Hattab
    Biostatistician
    ------------------------------



  • 9.  RE: R^2

    Posted 06-01-2020 16:02
    In addition to the postings responding directly to this question, I'll add that R^2, next to an isolated p-value, is then likely the second most overused tool in the box, where it is often assumed that it means more all by itself than it really does.  There is a tendency, I think, to assume it is better to quantify with one number, and to think that less subjective than to study a graph, but I would generally rather have a graphical residual analysis than to even know the R^2 value.  (To avoid overfitting to your particular sample, cross-validation would be good.) 

    R^2 may not tell you much.  Estimated variances of prediction errors for individual cases and for predicted totals in finite populations may be helpful.  Also, there is cross-validation.

    ------------------------------
    James Knaub (Jim)
    Retired Lead Mathematical Statistician
    ------------------------------



  • 10.  RE: R^2

    Posted 06-02-2020 13:59
    Thanks, everyone, for your helpful contributions on this topic. I am still bothered by the terminology. The "textbook" coefficient of determination (COD) formula for R^2 isn't the square of anything and it isn't a correlation, except in a limited OLS case. It is a misnomer perpetuated by people's desire to have a quasi-R^2 statistic they can extend to nonlinear and machine learning models. As you know, there are several other quasi-R^2 statistics, and none of them is very satisfying. I've also decided to stop using the terminology R^2 for this statistic and instead call it by its more proper name, Coefficient of Determination.

    There was one correction, made by two commenters, that the COD cannot be negative in simple nonlinear regression. I misstated that case. For other models, however, it definitely can be negative. 

    The discussion reminded me of one of my favorite papers by my dissertation advisor. (https://en.wikipedia.org/wiki/Abelson%27s_paradox). Robert Abelson was a student of John Tukey and we used the pre-publication copy of Tukey's EDA book in his analysis of variance course. Like Tukey, Abelson resisted the oversimplifications brought on by the worship of single summary statistics. He showed in this paper how important scientific results were often paradoxically associated with small R^2 values. It was then, in the early 1970's, that I also learned how important graphical data analysis was. Abelson and Tukey taught us that presenting statistical (or machine learning) models without examining residuals graphically is irresponsible.

    Thanks for this discussion. I learned a lot, which was my original intention

    ------------------------------
    Leland Wilkinson
    H2O
    ------------------------------



  • 11.  RE: R^2

    Posted 06-02-2020 17:35
    I share your preference for if not your eloquence about graphical analysis over single-number summary statistics.   Yet there is at least one other single-number statistic that seems worth mentioning for any who haven't seen it--the Nash-Sutcliffe efficiency.  It has a range that extends from -infinity to +1.

    Bill

    ------------------------------
    Bill Harris
    Data & Analytics Consultant
    Snohomish County PUD
    ------------------------------



  • 12.  RE: R^2

    Posted 06-03-2020 12:32
    Most statistical measures can be deceiving. There are two final measures of how well a model is performing.

    One is how does it do against an independent data set. Bear in mind, I do not mean a holdout, holdout performance only confirms that the model performs well within the data on which it was developed. Performance on an independent data set (out of time and or space) will measure how well the model will perform in the future or in other spaces.

    The second measure is how well the model performs in actual application. How much better does it do than average. A very significant model can only be slightly better than average performance if the sample is large enough. The key here is to look at the space into which the model will be applied. A significant prediction of performance is only useful if it predicts significantly useful results. For example if a risk model predicts risk that is 5x the average risk for a significant part of the population then it is useful. If it predicts risk that is 1.01x average for all but a small group, even if that is at a very significant level, it may not be very useful.


    ------------------------------
    Michael Mout
    ------------------------------



  • 13.  RE: R^2

    Posted 06-03-2020 15:16
    Yes Michael, I couldn't agree more. We were really talking about comparing models based on the data we have. Admittedly  it was perhaps an academic exercise in goodness of it measures. Certainly cross-validation is almost always too optimistic because as you say, a new data set will not be collected under the same conditions. The proof really is in the pudding -- which is what I was trying to hint at in my last sentence about whether the model turns out to be useful. It certainly won't last forever, but even how long it's useful with depend on the area of application. We should always remember George Box -- especially his warning that statisticians, like artists, have the bad habit of falling in love with their models.

    ------------------------------
    Dick De Veaux
    Vice President ASA 2019-2022
    Past Chair Statistical Learning and Data Science
    Williams College
    ------------------------------



  • 14.  RE: R^2

    Posted 06-03-2020 15:26
    Richard, my point is sometime modelers fall in love with their models and/or tools and lose site of how it will be used and what will be the "business/practical" impact.

    ------------------------------
    Michael Mout

    ------------------------------



  • 15.  RE: R^2

    Posted 06-03-2020 19:48

    As I recall, this thread started out about regression through the origin, the R^2 for which
    I have not thought about in years and don't remember what I concluded. Sorry about how long this got.

    1) R^2 does not measure model fit, it is a measure of predictive ability.
    It is easy to find incorrect models with high R^2s and to construct perfect models with low R^2.
    R^2 works fine as a measure of fit for COMPARING models on a single population with the usual caveat
    that fitting more worthless variables typically raises R^2.

    2) R^2 estimates the multiple correlation coefficient, i.e., the maximum of the correlation between y and any linear predictor.
    More interestingly, it is also the estimated squared correlation between a random observation and its (theoretical) best linear predictor.
    As such, it is estimating something defined by its expectation over both x and y, which is why
    R^2 rarely works for comparing different data sets because they have different x distributions.

    If you are doing nonlinear prediction, estimating [Corr(y,yhat)]^2 seems the obvious thing to do
    and continues to share many of the nice properties of the R^2 from best linear prediction.
    The theoretical best (nonlinear predictor), E(y|x), maximizes [Corr(y,yhat)]^2.
    E(y|x) can be hard to estimate but it has the same mean as y,
    so requiring mean(yhat) = ybar is little enough to ask of any estimated predictor yhat.
    You can even fit a transformed y variable, then back transform the predictions to the original
    scale, and use [Corr(y,yhat)]^2 to compare that to untransformed fits.
    [Corr(y,yhat)]^2 also works fine when y in binary regression but in that case no interesting data ever predict well.
    (To predict well, the cases have to have probabilities mostly near 0 or 1.)
    However such cases rarely satisfy the condition mean(yhat)=ybar.

    3) Yes, you can have lousy prediction with high values of [Corr(y,yhat)]^2 but that
    is easy to fix. Create a new predictor by regressing y on yhat.

    4) Regression through the origin is pretty strange stuff and seems inappropriate unless you
    are actually collecting data near x=0 (because you are imposing a condition that is outside the
    range of reasonable approximation). In particular, it assumes E(y|0)=0. Regression with
    any assumption E(y|x_0)=y_0 does not seem to be very common.

    In linear models with an intercept (vector J), [Corr(y,yhat)]^2 is the square of the cosine of the angle
    between the vectors (Y-ybar J) and (Yhat-ybar J). Small angles mean good agreement and a cosine near one.
    As mentioned earlier, this does not necessarily mean good prediction but that is easy to fix. [Corr(y,yhat)]^2 seems most
    sensible when mean(yhat) = ybar, which always happens with least squares fits in linear regression with an intercept.

    Without mean(yhat) = ybar (say regression through the origin), you still get the squared cosine of the angles
    between Y and Yhat as (Y'Yhat)^2/(Y'Y)(Yhat'Yhat). Again, that does not mean yhat is a good predictor but it is easy
    to fix by regressing y on yhat either with or without an intercept.
    Relative to a linear model with an intercept fitted by least squares,
    this formula just adds the correction factor n(ybar)^2 to the numerator and denominator of
    R^2=SSReg/SST which makes the number (much) closer to one.
    (This new "R^2" gives us a lot of undeserved credit for being smart enough to fit a mean value to the data, something which is not being done in
    regression through the origin.)

    The formula [Corr(y,yhat)]^2 = R^2=1-SSE/SST only works because it has built into it the
    least squares estimation of the best linear predictor of y. If you do robust or penalized estimation,
    even on a linear model, it can break down. It requires
    (Y-ybar J)' (Yhat-ybar J) = (Yhat-ybar J)' (Yhat-ybar J) = Y'Y - (Y-Yhat)'(Y-Yhat)
    all of which works for least squares estimation in linear models with an intercept but
    is not likely to work otherwise. In particular, it does work after creating a new predictor by
    regressing y on an old predictor and an intercept.
    (Least squares estimation for linear models without
    an intercept has (Y'Yhat)^2/(Y'Y)(Yhat'Yhat) = 1-SSE/Y'Y.
    More generally, best linear prediction through the origin seems completely analogous to least squares regression
    through the origin.)

    ------------------------------
    Ronald Christensen
    Univ of New Mexico
    ------------------------------



  • 16.  RE: R^2

    Posted 06-02-2020 14:40
    Re square of correlation coefficient between observed and predicted as goodness of fit.  This requires some (strong?) assumptions about the process leading to a fitted model.  If you cast about measuring goodness of fit of various randomly-constructed models (and ML is sometimes, something like that!), you could come up with a situation like this  Data: y=x.  Model y_hat=100*x+50.  Rho(y,y_hat)=1, therefore goodness of fit is perfect and we can stop looking?

    ------------------------------
    John Major
    Guy Carpenter & Co., LLC
    ------------------------------



  • 17.  RE: R^2

    Posted 06-05-2020 09:54
    There is a poster at SDSS 2020 being presented this morning related to this thread by Gyasi K Dapaa
    It references to this blog post: https://gkdblog.com/two-notes-about-the-two-faces-of-r-squared/

    ------------------------------
    Arthur Carbonare De Avila
    ------------------------------