ASA Connect

 View Only
  • 1.  Coefficient of Determination

    Posted 11-30-2017 16:20
    I've been reading the Wikipedia entry for the Coefficient of Determination (CoD) and am dumfounded by the amount of misinformation in that page. My reading led me to a bunch of other statistics and machine learning blogs and sites containing similar errors. First of all, the Wikipedia article equates R^2 with the CoD, without realizing that these two statistics have different interpretations and computational formulas depending on the type of model. Only in the context of certain OLS models are they equal. 

    The CoD was introduced almost 100 years ago, not long after the multiple correlation coefficient was so named. Both were explained in terms of an OLS linear model. Applying the 1 - SSR/SST formula in the case of nonlinear least squares or certain machine learning models, however, can lead to negative values for R^2. Wikipedia explains this circumstance as follows: "Important cases where the computational definition of R^2 can yield negative values, depending on the definition used, arise where the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept." I have no idea what that sentence means.

    I got on to this topic because I now work in a machine learning company (H2O) where users frequently ask, "How can R^2 be negative when it's the square of something?" (I got the same question when my SYSTAT users ran nonlinear regression models.) My answer to their reasonable question is that the software is using the wrong formula. Instead of the "proportion of variance" formula, the software should be correlating observed and predicted values and then squaring that correlation. This latter calculation generalizes beautifully to the results of many nonlinear models, GLMs, random forests, etc. But the interpretation of the value in these cases should not be made in terms of proportion of variance accounted for by the model. That works only for OLS linear models.

    The problem is not only with negative values. Some nonlinear regression programs print ridiculously large R^2 values and then try to adjust them or explain that they are not useful in the context of nonlinear models. It's not that their R^2 values are not useful; they're wrong.

    When I discussed this problem around the office, Erin LeDell suggested the following link: 


    Makes sense to me.

    Lee Wilkinson


  • 2.  RE: Coefficient of Determination

    Posted 12-01-2017 10:16
    Lee,
    While I agree with your observations completely I am in awe of the amount of time and energy you have to (a) be reading Wikipedia for fun, and (b) taking the time to suggest repairs/corrections. I suspect that you have too much spare time.
    Hope all is well with you and your family and that your moving/house closings/etc. are progressing smoothly.
    H





  • 3.  RE: Coefficient of Determination

    Posted 12-01-2017 16:19

    Lee,

        Isn't Wikipedia supposed to be "crowd sourced", so that "everybody" can go in and fix errors (I guess even minor ones that are far less egregious than the one you describe)?  Have you tried getting engaged in that?

    Steve

     






  • 4.  RE: Coefficient of Determination

    Posted 12-04-2017 09:49
    Yes, Steve, I did try that on another piece of statistical misinformation. The editors reverted my changes.
    Lee





  • 5.  RE: Coefficient of Determination

    Posted 12-05-2017 08:02
    Perhaps ASA should offer to partner with Wikipedia and form an editorial board to
    arbitrate the correctness of edits on statistical content. I'm sure Lee's experience of
    working to improve an article and having his work rejected was very frustrating. On
    the whole, I find Wikipedia useful and believe it is a resource worthy of investment.

    Elgin S. Perry, Ph.D.
    Statistics Consultant
    377 Resolutions Rd.
    Colonial Beach, Va. 22443
    ph. 410.610.1473




  • 6.  RE: Coefficient of Determination

    Posted 12-06-2017 11:45
    I've edited a few Wikipedia pages a while back, and if someone reverts you edits, you should look at the "Talk" tab for that page and possibly add a comment explaining the rationale for your change. It may be also wise to look at the following page before editing anything of a statistical nature on Wikipedia.

    https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Statistics

    If you are like me, you don't like other people tinkering with the stuff you've written, and that's not a good mindset for working on Wikipedia. But the quality of Wikipedia's articles on Statistics is quite uneven and they certainly need input from us, whether they realize it or not. If you have the time, keep plugging away, because persistence pays off on Wikipedia.

    --
    Steve Simon, mail@pmean.com
    I'm blogging now! blog.pmean.com




  • 7.  RE: Coefficient of Determination

    Posted 12-06-2017 04:27
    A solo contributor acting alone has little chance with the Wikipedia masters.  In today's world it might make sense to launch a twitter shaming campaign.  Outright errors should not be tolerated and neither should incomprehensible text.

    ------------------------------
    Dan Steinberg
    Chief Scientist and Product Evangelist
    Salford Systems, A Minitab Company
    ------------------------------



  • 8.  RE: Coefficient of Determination

    Posted 12-04-2017 12:40

    It's very easy to edit Wikipedia articles. And there's a Statistics WikiProject solely devoted to improving coverage of statistics topics on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Statistics. There you can find a matrix of articles by topic and quality. I encourage anyone to dive in!

     

    Jason






  • 9.  RE: Coefficient of Determination

    Posted 12-06-2017 11:52
    This thread appears to have devolved into a discussion of Wikipedia. That was not my intent. Wikipedia, in this instance, reflects the opinions of a segment of the statistical software user and developer community. I was trying to point out that the misuse of CoD is widespread in the machine learning and nonlinear modeling community and this misuse represents a false generalization of R^2 in linear models to other nonlinear contexts. Editing the Wikipedia page will not solve the problem, in my opinion, because opponents could easily dredge up other published resources that promulgate the same misinformation. I don't want to get involved with this controversy. I just wanted to point it out.

    ------------------------------
    Leland Wilkinson
    H2O
    ------------------------------