ASA Connect

 View Only
  • 1.  Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-08-2019 20:45
    Heteroscedasticity in regression versus unequal variances among data samples

    To me, the term "heteroscedasticity" only refers to estimated residuals for various predictions.  Various data samples with different variances would be a completely different phenomenon.  Yet I see people on ResearchGate routinely referring to the latter as "heteroscedasticity" also.  I asked a question about it on ResearchGate, and my concern was dismissed with a big So what? 

    Well, for one thing, if you use "heteroscedasticity" as a key word in a search, you'd like to know what you are going to get.  For another, laughable considering how messy I am, this conflation offends my sense of order. 
      :-) 

    Comments? 

    Cheers - Jim

    ------------------------------
    James Knaub
    Retired Lead Mathematical Statistician
    Retired
    ------------------------------


  • 2.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-09-2019 11:29
    The two archenemies, RA Fisher and Karl Pearson agreed at least on one thing- scedasticity (or skedasticity) - by definition, it is the spread in a variable. Of course, homo is equal or constant and hetero is unequal or variable.  In the linear regression context, scedasticity implies the measure of variability of a dependent variable across the range of values of a predictor variable (independent variable) as expressed by the residuals or variances.  The two expressions heteroscedasticity and heterogeneity of variances imply the same thing what a physicist would call dispersion.  Same is true for homoscedasticity and homogeneity of variances.  In the regression context, the following two articles describe the method of examining scedasticity:

    1. T. Breusch and A. Pagan, Econometrics, 47(5): 1287-1294, 1979.
    2. RD Cook and S. Weisberg, Biometrika, 70(1): 1-10, 1983. (They generalized the Breusch-Pagan method).

    In general, linear regression techniques are robust to slight-to-moderate heteroscedasticity.  If there is a serious concern, might try simple logarithmic or the Box-Cox transformation to alleviate the problem.   Also, SAS, BMDP, R+ Routines for regression should be able to help building models under heteroscedasticity.

    Hope it clarifies the situation.

    ------------------------------
    Ajit K. Thakur, Ph.D.
    Retired Statistician
    ------------------------------



  • 3.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-09-2019 14:05
    In response to "In general, linear regression techniques are robust to slight-to-moderate heteroscedasticity," I'd say that is usually true for prediction, but not the estimated variance of the prediction error.  In response to heteroscedasticity being a "problem," I'd say it is a natural phenomenon.  But this does not address my issue.  I have worked a good deal with heteroscedasticity. My issue is what I consider the misuse of the term heteroscedasticity, as described in my original post. Thank you.  

    ------------------------------
    James Knaub
    Retired Lead Mathematical Statistician
    ------------------------------



  • 4.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-09-2019 11:41
    ​I've seen enough shoddy statistical advice/information on ResearchGate that I don't consider anything I read in the Q/A there reliable. (:
    Vince

    ------------------------------
    Vincent Staggs, PhD
    Research Faculty, Biostatistics & Epidemiology Core, Children's Mercy Hospitals & Clinics;
    Associate Professor, School of Medicine, University of Missouri-Kansas City
    ------------------------------



  • 5.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-09-2019 15:34
    I'm not sure my issue was understood here.

    This is an elementary explanation of heteroscedasticity that I wrote for Sage:
     https://www.researchgate.net/publication/262972023_HETEROSCEDASTICITY_AND_HOMOSCEDASTICITY
    (Today I would redo the graph to attempt to clearly show that estimated residuals are measured vertically.)

    This explains the nature of heteroscedasticity and why it should be expected:
    https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity

    And other issues including measuring heteroscedasticity are addressed here:
    https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression

    So we know what hetetoscedasticity addresses.  But none of this is addressing the issue I raised. 


    My issue is that many people look at multiple data samples with different variances, nothing to do with regression, and use the term "heteroscedasticity" for that.  I have an issue with that.  This is a different kind of variance.  Yes the different data sets may have different estimates of population variance, but that is a superficial relationship to the word "heteroscedasticity," and has nothing to do with what should actually be meant by "heteroscedasticity," or at least I do not think the same term, "heteroscedasticity," should be used for these two different things.  Yes, in multilevel models the issue is more complex, but when not referring to regression, I would not use the word "heteroscedasticity." 

    Thank you.

    ------------------------------
    James Knaub
    Retired Lead Mathematical Statistician
    Retired
    ------------------------------



  • 6.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-10-2019 14:59
    Yes James, they are the same thing  - - at least within a limited context.
    Consider 1-way completely randomized ANOVA or even a 2-group "independent samples" test of means, the so-called t-test.

    These can be done a linear models using indicator (dummy) or effect (deviation) coding to create the IVs for group membership.  In the most common/simple of those applications, these regression models produce yhats that are the observed group means for the outcome variable.  Therefore, the regression residuals are the difference of Y values from their Yhat value.  These difference quantities are the same thing that is computed as a "within-group" deviation in the ANOVA or t-test context.  And  it is these within group deviations that produce the "cell" variances that are assumed to be homogeneous.

    One perspective that helps is that in a 2-group comparison, there are only two yhat values generated by the regression.  These are the two group means.  Visualize standard residual vs yhat diagnostic plot (yhat on the X axis).  There would only be two vertical "stacks" of points visible, located along the X axis at the two yhat values.

    So, homogeneity of within group variance (HOV) is the same as homogeneity of residual variance (homoscedasticity).  The direct parallel may not extend to other contexts where the term homoscedasticity is used, but I would bet that the ResearchGate conversations are not focused on anything but the parallel I outlined here.  Perhaps a way to say it is that in ANOVA type models, homogeneity of variance is always the same thing as homoscedasticity, but other types of homoscedasticity may also be a characteristic of different prediction models.

    Hope that helps.

    Bruce
    University at Albany

    ------------------------------
    Bruce Dudek
    Professor of Psychology
    University of Albany
    ------------------------------



  • 7.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-10-2019 16:28
    Thanks Bruce. 

    To me it is important to distinguish between y given a value of a continuous data variable, x, or function of x's, say predicted y, on one hand, and a mean y from a data set on the other, and I concentrate on heteroscedasticity, not homoscedasticity, which I think is often more artificially enforced, but I get your point.  Thank you for addressing the issue.  That was exactly what I was looking for.  I appreciate your input. 

    I do see a lot of people confusing data for estimated residuals, and thinking they need data to always be 'normal' to do a regression, rather than looking at the estimated residuals where normality is helpful, but I see the context that you are pointing out.  Thanks.

    Cheers - Jim

    ------------------------------
    James Knaub
    Retired Lead Mathematical Statistician
    Retired
    ------------------------------



  • 8.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-10-2019 19:10
    I think a difficulty with the approach you outlined is that heteroskedasticity of residuals is also called "conditional heteroskedasticity," implying that unconditional heteroskedasticy is something different. And of course the underlying Greek just means non-uniform (Herero) variation (skedasticity), which could apply equally well to either.

    in general I think it is a weakness of statisticians that we can tend to be more pedantic than necessary about definitions. We tend to take whatever we were taught or first learned and then treat other usages as wrong, even upsetting. Like grammarians and others with these issues, we sometimes tend to make fine distinctions that may not serve much purpose other than to identify us as educated.

    in general doing applied work requires being able to work with others outside ones field, or in ones field but in a different sub sector of it. It will often happen that two different people will use the same words but each give it a slightly different meaning. Statistics is a very broad discipline with many subfields. Semantic changes can occur through essentially random processes. So wne. we observe semantic variation, we shouldn't be surprised. We are seeing yet another application of our discupline's theory.

    In general flexibility and open-mindedness in communication - trying to neutrally understand what the other person is trying to say - can be critical to progress. A mindset reflecting understanding and appreciation for variation can help in this. A too-developed sense of order when others use a word somewhat differently can be an impediment, just as in any other situation where we use a model that makes assumptions that reflect a disconnect from what we observe. The effective applied statistician is able to explain variation to a wide variety of people, accepting them where they are and working with them from there.



    ------------------------------
    Jonathan Siegel
    Deputy Director Clinical Statistics
    ------------------------------



  • 9.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-11-2019 08:44
    All,

    The discussion is interesting. If you want to research the intersection between sampling variability, sample design and model heteroscedasticity, search model-assisted sampling. Richard Royall, Carl Sarndal, roger wright and others looked at optimal or near optimal sample designs under heteroscedasticity models. This includes model-assisted construction of strata to minimize model-based variance, but the strata can be highly efficient under classical survey statistics variance estimation. 

    Good luck.

    ------------------------------
    Alan Roshwalb
    Senior Vice President
    Ipsos
    ------------------------------



  • 10.  RE: Heteroscedasticity in regression versus unequal variances among data samples

    Posted 07-11-2019 13:14
    Alan, turning discussion to that, note that in
    Särndal, C-E, Swensson, B., and Wretman, J.(1992), Model Assisted Survey Sampling, Springer-Verlang, section 7.3.3, notably page 254, we see optimal unequal probability sampling with regard to heteroscedasticity and an accordingly weighted ratio estimator.  But Royall would probably use balanced sampling and a strictly model-based ratio estimator approach.  The resulting samples would likely be similar.

    ------------------------------
    James Knaub
    Retired Lead Mathematical Statistician
    ------------------------------