ASA Connect

 View Only
  • 1.  Intercept Value

    Posted 03-26-2016 14:42

    By changing the value of intercept (beta 0) to 1, the value of R-Squared and R-Squared adjusted changes from 30 percent to about 95 percent. Though it is highly desirable, I am unable to understand the exact theory behind this. Could someone explain this?
    Thank you

    ------------------------------
    Uday Jha
    Rochester Institute of Technology
    ------------------------------


  • 2.  RE: Intercept Value

    Posted 03-28-2016 03:09

    Uday -  

    I saw that kind of result once, from a software program years ago, even though the intercept was small.  (I assume 1 is small in your case.)   I had to think it was a software glitch. (I checked, but could not get an answer, as I recall.) However, when you do eliminate the intercept term, you are changing the model, and Willett and Singer had a couple of articles some time ago, I think in The American Statistician, where they showed how r-square/R-square is redefined for different models, such that they are not directly comparable.  I have long felt that r-square/R-square, is much too volatile anyway - perhaps especially in the multiple regressions. 

    If you are in a situation such that a zero for your regressor (or regressors) should yield a y that is also zero, but changing the intercept by a small amount seems to make such a difference, I don't think you can really use that.  You should stick to regression through/to the origin if that is what makes sense.  Looking at some scatterplots, such a result as you found should look suspicious.  Or at least you can learn a great deal about your data by examining scatterplots. 

    When you resolve this, please let us know what happened.  

    Cheers -Jim

    ------------------------------
    James Knaub
    Lead Mathematical Statistician
    Retired



  • 3.  RE: Intercept Value

    Posted 03-28-2016 06:52

    Hi Uday,

     

    Are you sure that you changed a parameter value (in this case the intercept) or enabled the procedure to fit the intercept?  I suggest that you compare the two sets of model output.  If you're allowing the intercept to be fit then in your first model, only one set of estimates (for the slope) will be produced.  In the second there will be two.

     

    This could explain the increase in adjusted R and R-squared.

     

    Cheers,

    PDM  

     

    Pat Mitchell, MA

    Statistical Science Director

    AstraZeneca   Early Clinical Biometrics

    35 Gatehouse Dr

    Waltham, MA 02451-1215

    Office: 1-781-839-4982

    Cell: 1-302-420-3612 (Text OK)

    Home Office: 1-508-309-3813

     


    Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.






  • 4.  RE: Intercept Value

    Posted 03-28-2016 08:46

    Hi Uday,

    It sounds like you are fitting a regression line to some data, estimating both the slope and intercept, and then changing the model so that the intercept is fixed at 1, and only the slope is estimated from the data.  This would be a variation of fitting a straight line through the origin, but instead of setting the intercept at zero, you have set it to be one.

    R-Square is the ratio of the amount of variation explained by the model divided by the total variation = ModelSS / TotalSS. Since the ErrorSS = TotalSS - ModelSS, R-Square can also be calculated as 1 - (ErrorSS/TotalSS), whic provides a simpler explanation.  

    When the intercept is estimated by the data, the TotalSS is calculated as the sum of squared differences of the Y-values from their average (Sum(Yi-Y-bar)^2). This means that the R-Square for this model with both slope and intercept being estimated from the data is comparing the estimated line to a horizontal line through YBar.  That is why the p-value for slope is the same as the p-value for the regression model - it is really testing that the addition of the slope parameter to the model improves the fit compared to a traight line through the mean Y-bar.

    But when the intrecept is fixes at zero (r one, as  in your case) the "nitial" model is not a line through Y-bar but instead is a line through the specified intercept (i.e., a horizontal line through Y=0 or Y=1). So in this case both ModelSS and TotalSS are changed compared to the values when both intercept and slope are estimated.  And if the Y values are all far from zero, the TotalSS when the intercept is fixed will be MUCH LARGER than the TotalSS used for calculating R-Square when the intercepted is estimated form the data.  If your model with the intercept fixed at 1 provides a reasonable fit to the data then the ErrorSS will be slightly greater than the ErrorSS obtained when both slope and intercept are estimated from the data, so that the BIG DIFFERENCE in TotalSS shows up in the ModelSS, providing a much larger R-Square.

    So the reason for the great increase in R-Square when the intercept is specified to be zero or (one) is that the TotalSS used for the calculation changes to be the sum of squared differences between each Y value and the specied intercept (zero or one). This quantity will generally be much larger than the sum of squared differences about Y-bar.  Since the ErrorSS doesn't change much, nearly all of the increase in TotalSS is accumulated in ModelSS, thereby increasing ModelSS/TotalSS.

    For comparison of models in this case I sometimes like to re-calculate R-Square as 1 - (ErrorSS/TotalSS) with TotalSS calculated as the sum of squared differences from Y-bar, as would be done if both slope and intercept were estimated from the data.

    ------------------------------
    Walter Morgan
    Master Scientist
    RJ Reynolds Tobacco Company



  • 5.  RE: Intercept Value

    Posted 03-28-2016 08:50

    That doesn't sound all that desirable at all.  Biasing one's model by manipulating the intercept just means that one thinks one's "guess" is a better reflection of reality than whatever the data might say.  Furthermore, by itself, simply having an R-squared value of 95% doesn't explain anything.  The context of that R-squared and of the model, and probably several additional statistics related to that model would be necessary for any sort of reasonable understanding.  

    Cheers,
    Joe

    ------------------------------
    Joseph Nolan
    Associate Professor of Statistics
    Director, Burkardt Consulting Center
    Northern Kentucky University
    Department of Mathematics & Statistics



  • 6.  RE: Intercept Value

    Posted 03-28-2016 11:42

    Dear Uday Jha,

          If I'm understanding correctly what you are saying, you are first modeling the data where the intercept of the fitted ordinary least squares (OLS) regression line is forced to be zero, & the line then fits as best it can given that restriction (which, judging by the R squared value, is apparently not very good in your example).  However, by setting the intercept "values" all to one, the coefficient for the intercept effectively places the intercept where is should be, and provides the true best fitting OLS regression line, and the R squared value increases (substantially in your case).     

    ------------------------------
    Joseph J. Locascio, Ph.D.,
    Assistant Professor of Neurology,
    Harvard Medical School,
    and Statistician,
    Memory and Movement Disorders Units,
    Massachusetts Alzheimer's Disease Research Center,
    Neurology Dept.,
    Massachusetts General Hospital (MGH),
    Boston, Massachusetts 02114
    Phone: (617) 724-7192
    Email: JLocascio@partners.org



  • 7.  RE: Intercept Value

    Posted 03-28-2016 14:58

    Walter Morgan already gave a good explanation. Let me just add a few more things:

    1) Intuition seems to tell us that r^2 cannot increase when we fix one or more parameters in the linear model. After all, the sum of squared residuals (SSR) is minimized by an unconstrained OLS procedure, so any constraints should result in a larger SSR and hence a worse fit (note that the two different models are nested). The surprising fact is not that r^2 changes when we change the model; it's surprising that r^2 increases when we restrict our model space.

    2) The answer has to be that different definitions of r^2 have been used. Whereas there's no debate about the proper definition of r^2 in the unconstrained OLS setting, different software packages use different definitions of r^2 in the presence of constraints on parameters. Why is that? With unconstrained OLS, r^2 has several nice properties, e.g. it's always between 0 and 1, it may be written in different equivalent forms etc. With constrained OLS, the different textbook formulas are not equivalent anymore, and depending on which formula one uses, some of the nice properties of r^2 from unconstrained OLS are gone.

    3) Often, r^2 is computed as follows:  r^2 = 1 - SSR/(variance of y_i).  If one uses this formula with constrained OLS, it follows our intuition: the SSR gets larger, therefore r^2 decreases. But there is a drawback: with constrained OLS, this formula is not guaranteed to be >= 0, so you can get a negative r^2. Some users might not like this. But actually, this formula is used in software products, e.g. in Excel: if beta_0 is fixed, when calculating a "trend line" in Excel, a negative value of r^2 might result.

    4) If you want an r^2 which is always between 0 and 1 in the constrained case, you can look more closely into why r^2 actually is between 0 and 1 in the unconstrained case: it's because of the equation

    sum(y_i - mean(y))^2 = SSR  + sum (hat(y_i) - mean(y))^2

    This equation is a decomposition of the variance into two non-negative terms, and its derivation relies on the arithmetic of the unconstrained OLS estimation. When the intercept is fixed at beta_0, this equation is no longer valid. However, if we substitute mean(y) by beta_0, we get a valid equation in the case of OLS estimation with fixed intercept (it's not obvious, you have to do the calculus):

    sum(y_i - beta_0)^2 = SSR  + sum (hat(y_i) - beta_0)^2

    From this it follows, that  1 - SSR/sum(y_i - beta_0)^2   is always between 0 and 1. One might thus define this formula as r^2 in the case of fixed intercept. And this is actually done in several software packages. Is this better than the definition used in Excel or even some other measure of model fit? You can argue on this indefinitely (as has been in the literature).

    (BTW, since you did not tell us which software you use, you might have implemented yet another formula for r^2 in your software.) 

    Best,

    -Hans-

    ------------------------------
    Hans Kiesl
    Regensburg University of Applied Sciences
    Germany




  • 8.  RE: Intercept Value

    Posted 03-29-2016 02:57

    Hans -

    Your response should be very helpful.  Very nice.  I just have one comment though. You have "The surprising fact is not that r^2 changes when we change the model; it's surprising that r^2 increases when we restrict our model space."  Perhaps that might be explained in the case of regression through the origin because in such a case, you know from the subject matter that when x=0, y must be 0.  That implies additional information, so a better 'fit.' 

    Cheers - Jim

    ------------------------------
    James Knaub
    Lead Mathematical Statistician
    Retired



  • 9.  RE: Intercept Value

    Posted 03-30-2016 15:27

    I agree with James that " Perhaps that might be explained in the case of regression through the origin because in such a case, you know from the subject matter that when x=0, y must be 0."  However, I would add that, even when you "know the subject matter," it is worth trying an unrestricted OLS regression to see if the calculated intercept is not significantly different from zero, before forcing the intercept to zero.

    ------------------------------
    Jerome Yurow
    Retired