ASA Connect

 View Only
Expand all | Collapse all

CLT rule of thumb n>=30: not quite true!

  • 1.  CLT rule of thumb n>=30: not quite true!

    Posted 01-11-2017 14:11
    The central limit theorem (CLT) is one of the 'central' arguments (pun intended) to argue that the sample mean has approximately a Normal distribution. However, the central limit theorem does not state when a sample is large enough, it can't since when the sample is large enough depends on the distribution of the population.

    Now, traditionally the rule of thumb of greater or equal than 30 has been established, arguing that it suffices in most cases. Based on a quick search online, it appears that the n>=30 rule thumb is generally blindly referenced in academia and research. At the very least, this rule of thumb should come with a warning that the more asymmetric a population distribution is, the higher the sample size needed for CLT too apply. For example, the Normal approximation applies to a random variable with a binomial distribution under two conditions: np and n(1-p) are both greater or equal than 10 (I prefer this more conservative version). What these two conditions do is essentially check if the binomial distribution from the population is not too left or right skewed. If the probability of success is very high, say 0.999, then a sample size of 10,000 will be needed so that n(1-p)=10. Similar examples can be found if p is too small.

    Furthermore, it can be shown through simulations that in the case of an exponential distribution, or even a less skewed gamma distribution (say with shape parameter =5 and scale parameter =1) the sample mean when n=30 does not follow a Normal distribution. In the case of confidence intervals and hypothesis testing, the error incurred when using a Normal distribution for the sample mean may be compensated for in two sided scenarios, but not necessarily in one-sided scenarios.

    It is possible to emphasize a bit more the dependence on the skewness of the distribution when determining 'large enough n'. Boos and Hughes-Oliver (2000) -  present a way to do this for confidence intervals.



    I was not able to find much about this online or on AMSTAT education tools. If someone has further references or comments on this they will be much appreciated.

    --
    Roberto Rivera, PhD
    College of Business
    University of Puerto Rico, Mayaguez
    President of CAAEPR


  • 2.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 00:39
    I discuss the effect of skewness on the accuracy of t procedures in

    What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum, The American Statistician.  

    It turns out that you need n > 5000 before t procedures are "reasonably accurate" (to within 10%) for inferences for a one-sample mean when populations are exponential. There are much more accurate procedures. And bootstrap diagnostics that tell how far off you are, for a given sample.


    ------------------------------
    Tim Hesterberg
    Senior Quantitative Analyst
    Google
    ------------------------------



  • 3.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 08:16

    I have argued that the asymmetry of a distribution does not affect the coverage rate of a t-based two-sided interval as it clearly does a one-sided interval, because there are analogous two-sided intervals for the mirror-image of an asymmetric distribution and for the symmetric distribution composed of half the sum of an asymmetric and its mirror-image distributions. Is that wrong?  

     

    Btw, for binomial proportions, I recommend using two-sided Wilson (aka score) intervals rather than Wald intervals.  One can show that the logistic transformation works (when it works) because it approximates a Wilson interval.

     

    Phil Kott

    RTI International






  • 4.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 09:41
    The influence of skewness on the estimate of the mean is a problem that has long been addressed in hydrologic research.  Lot of material has been published in AGU's Water Resource Research.  

    For parametric analysis, suggest you might also look at discussions in a reference like 
    Techniques of Water-Resources Investigations of the United States Geological Survey Book 4, Hydrologic Analysis and Interpretation
    https://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf

    Violations of normal assumptions are one of the reasons that non-parametric methods are also frequently used in hydrology.  
    Suggest you might look at the still very useful classical reference by  W.J.Conover Practical Nonparametric Statistics (1999,3rd Edition). 

    By the way, not only skewness but other attributes such as serial correlation - even when sampling from a symmetric distribution - has a large effect on necessary sample size.  
    Suggest you might look at a classic- and still very relevant -  paper  by NC Matalas and WB Lanbein, "Information content of the mean",  Journal of Geophysical Research 1962. 

    ____________________________________
    J.M. Landwehr
    The Sumanim Group 
    ____________________________________







  • 5.  RE: CLT rule of thumb n>=30: not quite true!

    0