I've been reading the Wikipedia entry for the Coefficient of Determination (CoD) and am dumfounded by the amount of misinformation in that page. My reading led me to a bunch of other statistics and machine learning blogs and sites containing similar errors. First of all, the Wikipedia article equates R^2 with the CoD, without realizing that these two statistics have different interpretations and computational formulas depending on the type of model. Only in the context of certain OLS models are they equal.
The CoD was introduced almost 100 years ago, not long after the multiple correlation coefficient was so named. Both were explained in terms of an OLS linear model. Applying the 1 - SSR/SST formula in the case of nonlinear least squares or certain machine learning models, however, can lead to negative values for R^2. Wikipedia explains this circumstance as follows: "Important cases where the computational definition of R^2 can yield negative values, depending on the definition used, arise where the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept." I have no idea what that sentence means.
I got on to this topic because I now work in a machine learning company (H2O) where users frequently ask, "How can R^2 be negative when it's the square of something?" (I got the same question when my SYSTAT users ran nonlinear regression models.) My answer to their reasonable question is that the software is using the wrong formula. Instead of the "proportion of variance" formula, the software should be correlating observed and predicted values and then squaring that correlation. This latter calculation generalizes beautifully to the results of many nonlinear models, GLMs, random forests, etc. But the interpretation of the value in these cases should not be made in terms of proportion of variance accounted for by the model. That works only for OLS linear models.
The problem is not only with negative values. Some nonlinear regression programs print ridiculously large R^2 values and then try to adjust them or explain that they are not useful in the context of nonlinear models. It's not that their R^2 values are not useful; they're wrong.
When I discussed this problem around the office, Erin LeDell suggested the following link:
Makes sense to me.
Lee Wilkinson