I discuss the effect of skewness on the accuracy of t procedures in
What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum,
The American Statistician.
It turns out that you need n > 5000 before t procedures are "reasonably accurate" (to within 10%) for inferences for a one-sample mean when populations are exponential. There are much more accurate procedures. And bootstrap diagnostics that tell how far off you are, for a given sample.
------------------------------
Tim Hesterberg
Senior Quantitative Analyst
Google
------------------------------
Original Message:
Sent: 01-11-2017 14:10
From: Roberto Rivera
Subject: CLT rule of thumb n>=30: not quite true!
The central limit theorem (CLT) is one of the 'central' arguments (pun intended) to argue that the sample mean has approximately a Normal distribution. However, the central limit theorem does not state when a sample is large enough, it can't since when the sample is large enough depends on the distribution of the population.
Now, traditionally the
rule of thumb of greater or equal than 30 has been established, arguing that it suffices in most cases. Based on a quick search online, it appears that the n>=30 rule thumb is generally blindly referenced in academia and research. At the very least, this rule of thumb should come with a warning that the more asymmetric a population distribution is, the higher the sample size needed for CLT too apply. For example, the Normal approximation applies to a random variable with a binomial distribution under two conditions: np and n(1-p) are both greater or equal than 10 (I prefer this more conservative version). What these two conditions do is essentially check if the binomial distribution from the population is not too left or right skewed. If the probability of success is very high, say 0.999, then a sample size of 10,000 will be needed so that n(1-p)=10. Similar examples can be found if p is too small.
Furthermore, it can be shown through simulations that in the case of an exponential distribution, or even a less skewed gamma distribution (say with shape parameter =5 and scale parameter =1) the sample mean when n=30 does not follow a Normal distribution. In the case of confidence intervals and hypothesis testing, the error incurred when using a Normal distribution for the sample mean may be compensated for in two sided scenarios, but not necessarily in one-sided scenarios.
It is possible to emphasize a bit more the dependence on the skewness of the distribution when determining 'large enough n'. Boos and Hughes-Oliver (2000) - present a way to do this for confidence intervals.
I was not able to find much about this online or on AMSTAT education tools. If someone has further references or comments on this they will be much appreciated.
--
Roberto Rivera, PhD
College of Business
University of Puerto Rico, Mayaguez
President of CAAEPR