ASA Connect

 View Only
Expand all | Collapse all

CLT rule of thumb n>=30: not quite true!

  • 1.  CLT rule of thumb n>=30: not quite true!

    Posted 01-11-2017 14:11
    The central limit theorem (CLT) is one of the 'central' arguments (pun intended) to argue that the sample mean has approximately a Normal distribution. However, the central limit theorem does not state when a sample is large enough, it can't since when the sample is large enough depends on the distribution of the population.

    Now, traditionally the rule of thumb of greater or equal than 30 has been established, arguing that it suffices in most cases. Based on a quick search online, it appears that the n>=30 rule thumb is generally blindly referenced in academia and research. At the very least, this rule of thumb should come with a warning that the more asymmetric a population distribution is, the higher the sample size needed for CLT too apply. For example, the Normal approximation applies to a random variable with a binomial distribution under two conditions: np and n(1-p) are both greater or equal than 10 (I prefer this more conservative version). What these two conditions do is essentially check if the binomial distribution from the population is not too left or right skewed. If the probability of success is very high, say 0.999, then a sample size of 10,000 will be needed so that n(1-p)=10. Similar examples can be found if p is too small.

    Furthermore, it can be shown through simulations that in the case of an exponential distribution, or even a less skewed gamma distribution (say with shape parameter =5 and scale parameter =1) the sample mean when n=30 does not follow a Normal distribution. In the case of confidence intervals and hypothesis testing, the error incurred when using a Normal distribution for the sample mean may be compensated for in two sided scenarios, but not necessarily in one-sided scenarios.

    It is possible to emphasize a bit more the dependence on the skewness of the distribution when determining 'large enough n'. Boos and Hughes-Oliver (2000) -  present a way to do this for confidence intervals.



    I was not able to find much about this online or on AMSTAT education tools. If someone has further references or comments on this they will be much appreciated.

    --
    Roberto Rivera, PhD
    College of Business
    University of Puerto Rico, Mayaguez
    President of CAAEPR


  • 2.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 00:39
    I discuss the effect of skewness on the accuracy of t procedures in

    What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum, The American Statistician.  

    It turns out that you need n > 5000 before t procedures are "reasonably accurate" (to within 10%) for inferences for a one-sample mean when populations are exponential. There are much more accurate procedures. And bootstrap diagnostics that tell how far off you are, for a given sample.


    ------------------------------
    Tim Hesterberg
    Senior Quantitative Analyst
    Google
    ------------------------------



  • 3.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 08:16

    I have argued that the asymmetry of a distribution does not affect the coverage rate of a t-based two-sided interval as it clearly does a one-sided interval, because there are analogous two-sided intervals for the mirror-image of an asymmetric distribution and for the symmetric distribution composed of half the sum of an asymmetric and its mirror-image distributions. Is that wrong?  

     

    Btw, for binomial proportions, I recommend using two-sided Wilson (aka score) intervals rather than Wald intervals.  One can show that the logistic transformation works (when it works) because it approximates a Wilson interval.

     

    Phil Kott

    RTI International






  • 4.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 09:41
    The influence of skewness on the estimate of the mean is a problem that has long been addressed in hydrologic research.  Lot of material has been published in AGU's Water Resource Research.  

    For parametric analysis, suggest you might also look at discussions in a reference like 
    Techniques of Water-Resources Investigations of the United States Geological Survey Book 4, Hydrologic Analysis and Interpretation
    https://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf

    Violations of normal assumptions are one of the reasons that non-parametric methods are also frequently used in hydrology.  
    Suggest you might look at the still very useful classical reference by  W.J.Conover Practical Nonparametric Statistics (1999,3rd Edition). 

    By the way, not only skewness but other attributes such as serial correlation - even when sampling from a symmetric distribution - has a large effect on necessary sample size.  
    Suggest you might look at a classic- and still very relevant -  paper  by NC Matalas and WB Lanbein, "Information content of the mean",  Journal of Geophysical Research 1962. 

    ____________________________________
    J.M. Landwehr
    The Sumanim Group 
    ____________________________________







  • 5.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 15:24
    There should be no unfortunate consequences of the slowness of the CLT to convert from exponential to near-normal because a sample with n=30 is easily big enough for the analyst to be able to discern its empirical distribution. If the distribution is wildly asymmetric then the analysis should not be based on a normal distribution without data transformation. Teach that rather than teaching archaic and flawed rules of thumb.

    None of our students or colleagues should be so naive as to follow an n>30 rule of thumb blindly, or so careless as to analyse data with which they are not sufficiently familiar to tell the difference between an exponential and a normal distribution!

    Michael Lew
    Department of Pharmacology
    The University of Melbourne




  • 6.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 03:16
    Not sure if this is fact or factoid, but I remember reading that the n>30 myth came about when the printer of a textbook published soon after Gossett's work on Student's t said the table of critical values could have lines for only 30 degrees of freedom. Thereafter, students assumed 30 was enough to assume normality. Transfer of the myth to binomial sampling was easy. True or not, it's too good a story to lose.

    ------------------------------
    Robert Lovell
    Retired
    ------------------------------



  • 7.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 07:49
    In my opinion the CLT is far less useful for everyday practice than we have taught students.  In effect it does not work if you have to estimate the variance (which is always the case), and the sample size must sometimes be enormous to achieve good accuracy in confidence interval coverage, as demonstrated at Is there a reliable nonparametric confidence interval for the mean of a skewed distribution? using this R code:

    n <- 20000 # 150 #25
    nsim <- 10000
    mul <- 0; sdl <- 1.65 # on log scale
    dist <- c('normal', 'lognormal')[2]
    switch(dist, normal = {g <- function(x) x; mu <- mul},
    lognormal = {g <- exp; mu <- exp(mul + sdl * sdl / 2)})
    count <- c(lower=0, upper=0)
    z <- qt(0.975, n - 1)

    for(i in 1 : nsim) {
    x <- g(rnorm(n, mul, sdl))
    ci <- mean(x) + c(-1, 1) * z * sqrt(var(x) / n)
    count[1] <- count[1] + (ci[1] > mu)
    count[2] <- count[2] + (ci[2] < mu)
    }
    count / nsim

    With the log-normal distribution used above and n=20000 the confidence coverage is still bad  (left tail error 0.012, right 0.047 when both should be 0.025).

    Part of the problem is that when the distribution is skewed the variance estimate is no longer independent of the mean estimate, so convergence to the t distribution no longer works as assumed.

    ------------------------------
    Frank Harrell
    Vanderbilt University School of Medicine
    ------------------------------



  • 8.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 13:57
    Whether the t-table-based account for the "rule" is true or not, it does point to the heart of a fallacy in relying blindly on the rule.

    The t-table assumes that the underlying population is normally distributed; and the convergence of the t- to the z- table's probabilities (at "n = about 30") is conditional on that assumption.    At about n = 30, I suppose you could say the z table's about as good to use as the t-table. 

    But it's still an open question whether the true sampling distribution (for very asymmetrical population) is close enough to normal to use the z-table, either, in that case.  .

     
     

    ------------------------------
    William (Bill) Goodman
    Professor, Faculty of Business and Information Technology
    University of Ontario Institute of Technology
    ------------------------------



  • 9.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 14:01
    This rule of thumb not true,
    Depending upon skewness, the sample size requirement increases,
    For highly skewed data sets (e.g., with sd of logged data >2), samples of sizes 70,100 may not be large enough to be able to use
    the CLT rule-of-thumb of size 30.

    Environmental projects are abundant with these kind of skewed data set.
    Great day!

    --
    Anita Singh





  • 10.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 10:27

    Try this:

     

           Sugden, R. A., et al. (2002) "Cochran's Rule for Simple Random Sampling,

           J of the Royal Statistical Society, Series B, Statistical Methodology. 62(4):787-793.

     






  • 11.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 21:18
    Thank you all for the comments and additional references. I bring up the point because the rule of thumb is treated as fact with little to no mention about the importance of the skewness of the population distribution. In fact, the suggestion is that even high asymmetry is of no concern as long as the rule of thumb holds. Introductory statistics books, and even undergraduate level mathematical statistics textbooks generally treat the rule of thumb this way.

    Although those with advanced degrees of statistics are aware of the limitations of CLT, many without an advanced degree are not aware. As a result, the rule of thumb continues to be taught as fact. The hope as that this post creates some awareness.

    ------------------------------
    Roberto Rivera
    Associate professor
    University of Puerto Rico Mayaguez
    ------------------------------



  • 12.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-27-2017 12:01