In my opinion the CLT is far less useful for everyday practice than we have taught students. In effect it does not work if you have to estimate the variance (which is always the case), and the sample size must sometimes be enormous to achieve good accuracy in confidence interval coverage, as demonstrated at

Is there a reliable nonparametric confidence interval for the mean of a skewed distribution? using this R code:

n <- 20000 # 150 #25

nsim <- 10000

mul <- 0; sdl <- 1.65 # on log scale

dist <- c('normal', 'lognormal')[2]

switch(dist, normal = {g <- function(x) x; mu <- mul},

lognormal = {g <- exp; mu <- exp(mul + sdl * sdl / 2)})

count <- c(lower=0, upper=0)

z <- qt(0.975, n - 1)

for(i in 1 : nsim) {

x <- g(rnorm(n, mul, sdl))

ci <- mean(x) + c(-1, 1) * z * sqrt(var(x) / n)

count[1] <- count[1] + (ci[1] > mu)

count[2] <- count[2] + (ci[2] < mu)

}

count / nsim

With the log-normal distribution used above and n=20000 the confidence coverage is still bad

(left tail error 0.012, right 0.047 when both should be 0.025).

Part of the problem is that when the distribution is skewed the variance estimate is no longer independent of the mean estimate, so convergence to the t distribution no longer works as assumed.

------------------------------

Frank Harrell

Vanderbilt University School of Medicine

------------------------------

Original Message:

Sent: 01-12-2017 03:16

From: Robert Lovell

Subject: CLT rule of thumb n>=30: not quite true!

Not sure if this is fact or factoid, but I remember reading that the n>30 myth came about when the printer of a textbook published soon after Gossett's work on Student's t said the table of critical values could have lines for only 30 degrees of freedom. Thereafter, students assumed 30 was enough to assume normality. Transfer of the myth to binomial sampling was easy. True or not, it's too good a story to lose.

------------------------------

Robert Lovell

Retired

------------------------------

Original Message:

Sent: 01-11-2017 14:10

From: Roberto Rivera

Subject: CLT rule of thumb n>=30: not quite true!

The central limit theorem (CLT) is one of the 'central' arguments (pun intended) to argue that the sample mean has approximately a Normal distribution. However, the central limit theorem does not state when a sample is large enough, it can't since when the sample is large enough depends on the distribution of the population.

Now, traditionally the

**rule of thumb of greater or equal than 30** has been established, arguing that it suffices in most cases. Based on a quick search online, it appears that the n>=30 rule thumb is generally blindly referenced in academia and research. At the very least, this rule of thumb should come with a warning that the more asymmetric a population distribution is, the higher the sample size needed for CLT too apply. For example, the Normal approximation applies to a random variable with a binomial distribution under two conditions: np and n(1-p) are both greater or equal than 10 (I prefer this more conservative version). What these two conditions do is essentially check if the binomial distribution from the population is not too left or right skewed. If the probability of success is very high, say 0.999, then a sample size of 10,000 will be needed so that n(1-p)=10. Similar examples can be found if p is too small.

Furthermore, it can be shown through simulations that in the case of an exponential distribution, or even a less skewed gamma distribution (say with shape parameter =5 and scale parameter =1) the sample mean when n=30 does not follow a Normal distribution. In the case of confidence intervals and hypothesis testing, the error incurred when using a Normal distribution for the sample mean may be compensated for in two sided scenarios, but not necessarily in one-sided scenarios.

It is possible to emphasize a bit more the dependence on the skewness of the distribution when determining 'large enough n'. Boos and Hughes-Oliver (2000) - present a way to do this for confidence intervals.

I was not able to find much about this online or on AMSTAT education tools. If someone has further references or comments on this they will be much appreciated.

--

Roberto Rivera, PhD

College of Business

University of Puerto Rico, Mayaguez

President of CAAEPR