ASA Connect

 View Only
Expand all | Collapse all

CLT rule of thumb n>=30: not quite true!

  • 1.  CLT rule of thumb n>=30: not quite true!

    Posted 01-11-2017 14:11
    The central limit theorem (CLT) is one of the 'central' arguments (pun intended) to argue that the sample mean has approximately a Normal distribution. However, the central limit theorem does not state when a sample is large enough, it can't since when the sample is large enough depends on the distribution of the population.

    Now, traditionally the rule of thumb of greater or equal than 30 has been established, arguing that it suffices in most cases. Based on a quick search online, it appears that the n>=30 rule thumb is generally blindly referenced in academia and research. At the very least, this rule of thumb should come with a warning that the more asymmetric a population distribution is, the higher the sample size needed for CLT too apply. For example, the Normal approximation applies to a random variable with a binomial distribution under two conditions: np and n(1-p) are both greater or equal than 10 (I prefer this more conservative version). What these two conditions do is essentially check if the binomial distribution from the population is not too left or right skewed. If the probability of success is very high, say 0.999, then a sample size of 10,000 will be needed so that n(1-p)=10. Similar examples can be found if p is too small.

    Furthermore, it can be shown through simulations that in the case of an exponential distribution, or even a less skewed gamma distribution (say with shape parameter =5 and scale parameter =1) the sample mean when n=30 does not follow a Normal distribution. In the case of confidence intervals and hypothesis testing, the error incurred when using a Normal distribution for the sample mean may be compensated for in two sided scenarios, but not necessarily in one-sided scenarios.

    It is possible to emphasize a bit more the dependence on the skewness of the distribution when determining 'large enough n'. Boos and Hughes-Oliver (2000) -  present a way to do this for confidence intervals.



    I was not able to find much about this online or on AMSTAT education tools. If someone has further references or comments on this they will be much appreciated.

    --
    Roberto Rivera, PhD
    College of Business
    University of Puerto Rico, Mayaguez
    President of CAAEPR


  • 2.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 00:39
    I discuss the effect of skewness on the accuracy of t procedures in

    What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum, The American Statistician.  

    It turns out that you need n > 5000 before t procedures are "reasonably accurate" (to within 10%) for inferences for a one-sample mean when populations are exponential. There are much more accurate procedures. And bootstrap diagnostics that tell how far off you are, for a given sample.


    ------------------------------
    Tim Hesterberg
    Senior Quantitative Analyst
    Google
    ------------------------------



  • 3.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 08:16

    I have argued that the asymmetry of a distribution does not affect the coverage rate of a t-based two-sided interval as it clearly does a one-sided interval, because there are analogous two-sided intervals for the mirror-image of an asymmetric distribution and for the symmetric distribution composed of half the sum of an asymmetric and its mirror-image distributions. Is that wrong?  

     

    Btw, for binomial proportions, I recommend using two-sided Wilson (aka score) intervals rather than Wald intervals.  One can show that the logistic transformation works (when it works) because it approximates a Wilson interval.

     

    Phil Kott

    RTI International






  • 4.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 09:41
    The influence of skewness on the estimate of the mean is a problem that has long been addressed in hydrologic research.  Lot of material has been published in AGU's Water Resource Research.  

    For parametric analysis, suggest you might also look at discussions in a reference like 
    Techniques of Water-Resources Investigations of the United States Geological Survey Book 4, Hydrologic Analysis and Interpretation
    https://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf

    Violations of normal assumptions are one of the reasons that non-parametric methods are also frequently used in hydrology.  
    Suggest you might look at the still very useful classical reference by  W.J.Conover Practical Nonparametric Statistics (1999,3rd Edition). 

    By the way, not only skewness but other attributes such as serial correlation - even when sampling from a symmetric distribution - has a large effect on necessary sample size.  
    Suggest you might look at a classic- and still very relevant -  paper  by NC Matalas and WB Lanbein, "Information content of the mean",  Journal of Geophysical Research 1962. 

    ____________________________________
    J.M. Landwehr
    The Sumanim Group 
    ____________________________________







  • 5.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 15:24
    There should be no unfortunate consequences of the slowness of the CLT to convert from exponential to near-normal because a sample with n=30 is easily big enough for the analyst to be able to discern its empirical distribution. If the distribution is wildly asymmetric then the analysis should not be based on a normal distribution without data transformation. Teach that rather than teaching archaic and flawed rules of thumb.

    None of our students or colleagues should be so naive as to follow an n>30 rule of thumb blindly, or so careless as to analyse data with which they are not sufficiently familiar to tell the difference between an exponential and a normal distribution!

    Michael Lew
    Department of Pharmacology
    The University of Melbourne




  • 6.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 03:16
    Not sure if this is fact or factoid, but I remember reading that the n>30 myth came about when the printer of a textbook published soon after Gossett's work on Student's t said the table of critical values could have lines for only 30 degrees of freedom. Thereafter, students assumed 30 was enough to assume normality. Transfer of the myth to binomial sampling was easy. True or not, it's too good a story to lose.

    ------------------------------
    Robert Lovell
    Retired
    ------------------------------



  • 7.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 07:49
    In my opinion the CLT is far less useful for everyday practice than we have taught students.  In effect it does not work if you have to estimate the variance (which is always the case), and the sample size must sometimes be enormous to achieve good accuracy in confidence interval coverage, as demonstrated at Is there a reliable nonparametric confidence interval for the mean of a skewed distribution? using this R code:

    n <- 20000 # 150 #25
    nsim <- 10000
    mul <- 0; sdl <- 1.65 # on log scale
    dist <- c('normal', 'lognormal')[2]
    switch(dist, normal = {g <- function(x) x; mu <- mul},
    lognormal = {g <- exp; mu <- exp(mul + sdl * sdl / 2)})
    count <- c(lower=0, upper=0)
    z <- qt(0.975, n - 1)

    for(i in 1 : nsim) {
    x <- g(rnorm(n, mul, sdl))
    ci <- mean(x) + c(-1, 1) * z * sqrt(var(x) / n)
    count[1] <- count[1] + (ci[1] > mu)
    count[2] <- count[2] + (ci[2] < mu)
    }
    count / nsim

    With the log-normal distribution used above and n=20000 the confidence coverage is still bad  (left tail error 0.012, right 0.047 when both should be 0.025).

    Part of the problem is that when the distribution is skewed the variance estimate is no longer independent of the mean estimate, so convergence to the t distribution no longer works as assumed.

    ------------------------------
    Frank Harrell
    Vanderbilt University School of Medicine
    ------------------------------



  • 8.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 13:57
    Whether the t-table-based account for the "rule" is true or not, it does point to the heart of a fallacy in relying blindly on the rule.

    The t-table assumes that the underlying population is normally distributed; and the convergence of the t- to the z- table's probabilities (at "n = about 30") is conditional on that assumption.    At about n = 30, I suppose you could say the z table's about as good to use as the t-table. 

    But it's still an open question whether the true sampling distribution (for very asymmetrical population) is close enough to normal to use the z-table, either, in that case.  .

     
     

    ------------------------------
    William (Bill) Goodman
    Professor, Faculty of Business and Information Technology
    University of Ontario Institute of Technology
    ------------------------------



  • 9.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 14:01
    This rule of thumb not true,
    Depending upon skewness, the sample size requirement increases,
    For highly skewed data sets (e.g., with sd of logged data >2), samples of sizes 70,100 may not be large enough to be able to use
    the CLT rule-of-thumb of size 30.

    Environmental projects are abundant with these kind of skewed data set.
    Great day!

    --
    Anita Singh





  • 10.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 10:27

    Try this:

     

           Sugden, R. A., et al. (2002) "Cochran's Rule for Simple Random Sampling,

           J of the Royal Statistical Society, Series B, Statistical Methodology. 62(4):787-793.

     






  • 11.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-12-2017 21:18
    Thank you all for the comments and additional references. I bring up the point because the rule of thumb is treated as fact with little to no mention about the importance of the skewness of the population distribution. In fact, the suggestion is that even high asymmetry is of no concern as long as the rule of thumb holds. Introductory statistics books, and even undergraduate level mathematical statistics textbooks generally treat the rule of thumb this way.

    Although those with advanced degrees of statistics are aware of the limitations of CLT, many without an advanced degree are not aware. As a result, the rule of thumb continues to be taught as fact. The hope as that this post creates some awareness.

    ------------------------------
    Roberto Rivera
    Associate professor
    University of Puerto Rico Mayaguez
    ------------------------------



  • 12.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-27-2017 12:01
    There is an important aspect of this discussion that has not appeared yet.  Many seem to be criticizing this "Rule of Thumb" on the grounds that there are situations where "N > 30, yet the data clearly don't exactly follow a normal distribution".  My objection to this is that it somehow gives the impression that Rules of Thumb are about absolutes.  Several contributors have used terminology along the lines of "the data following the normal distribution" and "the normal distribution is invalid".
        Instead, I recommend that the whole thing should be viewed in terms of "degrees of approximation".  In some data analytic situations the normal approximation is "pretty good", and not so good in others (not "True" versus "False").  For this reason, when I teach that Rule of thumb in an elementary course, I say "Most statisticians often feel comfortable when n > 30", and of course point some of the situations where this can fail and teach how to do diagnostics.
        I see a parallel between this, and the so prevalent misinterpretation of p-values:  P-val = 0.049 --> "positive result, strong evidence, let's publish",   P-val = 0.051 --> "negative result, nothing there, no pub from this".  Of course we should be interpreting p-values in terms of gradations of "strength of the evidence", just as these rules of thumb should be viewed as rough guidelines to goodness of approximations, and not as absolute cut-off type rules.
        One more missing component of this  conversation has been that while skewness is indeed a serious issue (as several have noted).  Heaviness of tails (e.g. kurtosis) can also be a major factor.  A simple way to see this is to note that the Cauchy distribution (recall no finite moments) has the "semigroup property", i.e. sample means of the Cauchy are again Cauchy, and thus can never converge to Normal. 
        About real life applications, one of the most non-Normal worlds I've worked in is the study of internet traffic.  A particularly wild data analytic experience (which generated some fun related probability theory) is detailed in:
    Hernandez-Campos, F., Marron, J. S., Samorodnitsky, G., & Smith, F. D. (2004). Variable heavy tails in internet traffic. Performance Evaluation, 58(2), 261-284.
    Best,
    Steve

    ------------------------------
    J. S. Marron
    Univ. of North Carolina At Chapel Hill
    ------------------------------



  • 13.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 02-01-2017 07:49
    James Stephen Marron states the following: "Many seem to be criticizing this Rule of Thumb on the grounds that there are situations where N > 30, yet the data clearly don't exactly follow a normal distribution".  My objection to this is that it somehow gives the impression that Rules of Thumb are about absolutes". He later adds: "Most statisticians often feel comfortable when n > 30".

    I believe these remarks best illustrate how the rule of thumb is communicated inappropriately. First, the intention of this post was precisely to criticize how the rule of thumb is stated in textbooks. Here's an example: "Usually a value of n greater than 30 will ensure that the distribution of the sample mean can be closely approximated by a normal distribution" (See Wackerly, Mendenhall and Scheaffer, 2002, page 348). Here's another one: "You may have heard the rule of thumb that n ≥ 30 is required to ensure a normal distribution for the sample mean, but actually a much smaller n will suffice if the population is symmetric" (Doane and Seward, 2015). Indeed most textbooks will relay the rule thumb as some version of the above. Although these statements are not absolutes, they are expressed in strong terms and students will interpret them as either an 'approximately absolute statement' or 'an absolute statement'. Moreover, these statements do not tend to be followed with other statements explaining the dependence of the central limit theorem on skewness (and kurtosis) or how the data should be assessed for skewness.

    The second statement in the first paragraph, "Most statisticians often feel comfortable when n > 30" is another example of an 'approximately absolute statement', one that in my opinion, is untrue. Many statisticians would rather assess the skewness of the data. It appears that practitioners will often treat the rule of thumb as an absolute statement, in part because of how it was communicated to them in school. In fact, academic papers have been published relying on the rule of thumb beyond any doubt.

    In summary, my concern with the rule of thumb is how it is communicated in textbooks, and this issue is not simply academic. Rule of thumb statements should include clarifications of the influence of the skewness of the population distribution. People who will apply statistical methods in the future must understand that the rule of thumb is useful but that high skewness of the population distribution may require the sample size to be much larger than 30 for the central limit theorem to apply.

    ------------------------------
    Roberto Rivera
    Associate professor
    University of Puerto Rico Mayaguez
    ------------------------------



  • 14.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 09:07

    It may be that there are three separate issues here relating to sound practice:

    1. Even when data reasonably conform to a Normal distribution, I have noticed a serious problem sometimes is overlooked. It's not the how much for the n but rather what n is: what gets counted. n should not be the number of records but rather the number of records of interest. For example, when doing a study on a disease, n should not be the number in the study but rather the number who got sick. In the case of rare events, this can very dramatically increase the number of records required in order to get n => 30 outcomes we are trying to model. For example, when attempting to identify an unusual cause of death - my favorite remains Bortkiewicz' study of  death by horse kick in the Prussian army - we're going to want 30+ deaths after being kicked by a horse, not 30+ soldiers or 30+ horses. (And of course, this distribution isn't even Normal, which brings us to...)

    2. A number of commenters have observed that many more records are needed to get a good fit for highly skewed data. I only wish to point out - since we are discussing limitations of the CLT - is that it assumes normality. Yet, many of the problems we find involve fitting data that are far from a Normal distribution. Non-parametric methods and those designed for some specific non-normal distributions (e.g., gamma for highly-skewed data) are worth considering. Bortkiewicz' horse kick study is a great example of this, because the distribution in this famous example is Poisson, not Normal.

    Certainly, better fitting can come with more records. However, perhaps the real answer in many cases may be to take the CLT off the table altogether. 

    3. Rules of Thumb. Rather that a set of rules memorized by rote, which often can go wrong, I advise students and colleagues to use data-driven performance metrics. Instead of saying 30 is enough, we need to turn the question around, asking how many is enough for a particular use case. With all the excellent techniques we have today for sample size estimation, there is no need for a a fixed Rule of Thumb.

    n=> 30 or even using at Normal distribution at all (see #2), simply as a matter of standard practice, are standard practices we can live without. Students at University, new practitioners at our work places and sometimes less experienced ASA members at our local meetings will benefit from learning the best methods for sample size estimation instead of a one-size-fits-all rule that seldom fits anything very well.



    ------------------------------
    David J Corliss, PhD Analytics Architecture / Advanced Analytics Lead Ford Motor Company
    ------------------------------



  • 15.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-13-2017 11:09
    Hello everyone,

    I was recently looking this very topic up and I believe that the articles below are the original papers that helped define the n>=30 guideline The articles are old but have been scanned to PDF, so easily accessible. The CLT guideline is relevant to inference about means but not variances (extremely sensitive to deviations from the assumption of Normality); it is recommended as a guideline rather than a 'rule', as the effect of the CLT is progressive and depends on the exact shape of the distribution (for ex. a lower n can be sufficient for moderate skews, as illustrated through simulations for various distributions).


    Relation between the shape of population distribution and the robustness of four simple test statistics
    S. PEARSON, N. W. PLEASE
    Biometrika (1975) 62 (2): 223-241.
    DOI: https://doi.org/10.1093/biomet/62.2.223

     

    The robustness of the one-sample t-test over the pearson system
    Harry O. Posten
    Journal of Statistical Computation and Simulation, Volume 9, 1979 - Issue 2, Pages 133-149
    DOI: http://dx.doi.org/10.1080/00949657908810305  



    ------------------------------
    Brigitte Baldi
    Lecturer
    University of California, Irvine
    ------------------------------



  • 16.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-16-2017 07:37
    This is a good discussion.

    The fact that ramifications of the CLT are difficult for non-statisticians to grasp implies to me that we need to quit teaching this altogether to non-statisticians.

    ------------------------------
    Frank Harrell
    Vanderbilt University School of Medicine
    ------------------------------



  • 17.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-17-2017 12:32

    This is a good discussion.<br><br>The fact that ramifications of the CLT are difficult for non-statisticians to grasp implies to me that we need to quit teaching this altogether to non-statisticians.<br>
    Frank Harrell,  yesterday
    Hi Frank,
    What would you suggest that we teach in its place?  Bootstrapping and non-parametrics?
    <quotebtn></quotebtn>

    ------------------------------
    Andrew McDavid Biostatistics and Computational Biology University of Rochester
    ------------------------------



  • 18.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-17-2017 14:19

    As  an industrial statistician, some of the data collected about airplane tire parameters became nearly normal by averaging 4 test values. One issue with theoretical statisticians, is that they are great on showing theoretical areas of non-normality, but their ideas my not be practical in the real world of collecting data. In industry we are trying to solve issues by minimizing the cost of collecting data. Of course, we must confirm what we have found.

    John Stickler






  • 19.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-17-2017 15:45

    John,

    Thanks for posting this.  While I don't think Shewhart wrote about tires, I recall something in his Economic Control of Quality of Manufactured Product that sounded a lot like what you wrote about n = 4.  From something else I read, I gather that Shewhart's work wasn't derived from Neyman-Pearson thinking but was something he developed himself, almost like a third alternative to Fisher and Neyman-Pearson.

    AFAICT, Shewhart-style control charts are still used today, even as statistics has moved forward in other areas.  Are they still appropriate, or are there better statistical approaches to help management deal with process variation appropriately?  Much of the SPC literature I read seems separate from NHST thinking and from current Bayesian inference and decision making, although I do recall Stu Hunter's Bayesian Approaches to Teaching Engineering Statistics.

    Bill



    ------------------------------
    Bill Harris Data & Analytics Consultant Snohomish County PUD
    ------------------------------



  • 20.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-18-2017 03:14
    As regards the relationship between control charts and testing,
    this is a copy of a letter I wrote to the editor of JQT.  It was published in 2001.

    The article Woodall (2000) and its discussion examine the relationship between control
    charts and hypothesis tests. I would like to add, what I consider to be, key references to
    this discussion. Shewhart (1986, p.42) states, “For a prediction to have an operationally
    definite meaning, it is necessary that there be given or implied a perfectly definite way
    of determining whether it is true or false. Hence it is necessary that there be implied an
    operationally definite meaning of the statistical state of control in terms of characteristics
    of the sequence [of data]. There are two senses in which we may have such a meaning.
    One is the theoretical sense in which we include all possible criteria that the mathematical
    statistician may impose upon the infinite sequence [of data] as a characterization of what
    he means by a mathematical state of control. The other is the practical sense in which one
    chooses a limited group of criteria to be appled in some specific way to a finite portion of the
    sequence ...” Earlier, on page 40, Shewhart notes that Neyman-Pearson theory “involves
    the assumption that the observed data constitute a random sample, and we have already
    considered some of the difficulties involved in trying to give this term an empirical and
    operationally verifiable meaning. In fact, we may think of the whole operation of statistical
    control as an attempt to give such meaning to the term random.”

    Given Shewhart’s and Deming’s emphasis on operational definitions, I believe their
    argument is that control charts should be viewed as providing an operational definition of
    what it means for a process to be under control. Indeed, it is a simple exercise to see that
    a basic means chart is sensitive, not only to shifts in the process mean, but also to shifts
    in its variance and to positive correlation within the rational subgroups, cf. Christensen
    and Bedrick (1997). Thus, means charts provide quite a good beginning for creating an
    operational definition of what it might mean to have independent identially distributed
    observations. Philosophically, this is a far cry from assuming randomness and testing an
    hypothesis.

    References
    Christensen, Ronald and Bedrick, Edward J. (1997). “Testing the independence assumption
    in linear models,” Journal of the American Statistical Association, 92, 1006-1016.
    Shewhart, Walter A. (1986). Statistical Method from the Viewpoint of Quality Control.
    Dover, New York.
    Woodall, William H. (2000). “Controversies and Contraditions in Statistical Process Control,”
    with discussion, Journal of Quality Technology, 32, 341-378.

    ------------------------------
    Ronald Christensen
    Univ of New Mexico
    ------------------------------



  • 21.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-18-2017 11:45
    Frank,

    On the question of what to teach or not, I think this thread complements discussions going on in the ASA and elsewhere about the use of p-values:  

    If there's a key teachable take-away for applied statistics, I'd think it's on the importance of the carefully inquiring about and checking the assumptions for whatever method one is considering.  I don't think there's a 'magic bullet' method that avoids this requirement, for either estimation or hypothesis testing.

    I find this confirmed in papers being published in a journal where p-values are 'banned':   Just transforming of one measure (p-values) into another ('measures of effect size') without asking the familiar--and important--questions about sample size, population distribution, etc.  does not make someone's finding more valid.

    A writer in this thread reminds us about industrial applications where control charts make interpretative sense, even though nominally sample sizes are small.    In other applications they couldn't be used that way.  So from a teaching viewpoint, I'd be wary of conveying that any method is the last word, in itself without qualification, for any application context.

    ------------------------------
    William (Bill) Goodman
    Professor, Faculty of Business and Information Technology
    University of Ontario Institute of Technology
    ------------------------------



  • 22.  RE: CLT rule of thumb n>=30: not quite true!

    Posted 01-17-2017 15:22
    David Corliss said,"I only wish to point out - since we are discussing
    limitations of the CLT - is that it assumes normality. Yet, many of the
    problems we find involve fitting data that are far from a Normal
    distribution."
    
    I'd like to clarify some possible confusion here: There is more than one
    "Central Limit Theorem."  The simplest one says that the sum of
    independent normal random variables is normal. But there are others that
    say that the sum of independent random variables satisfying certain
    conditions is "approximately normal" (with "approximately normal" being
    defined in various ways, depending on the particular CLT). See
    
    https://en.wikipedia.org/wiki/Central_limit_theorem
     for more details.


    ------------------------------
    Martha Smith
    University of Texas
    ------------------------------