Well, this is probably the best thing for Dale and the young professionals--to see a discussion among other statisticians.So, I do appreciate the opportunity to respond to JoAnn, who took the trouble to help Dale out, which is the goal for all of us. (Why else would I be doing this at 10:30 at night! But, JoAnn beat me there. Her post is timed at a heroic12:15 am.)
I am going to ignore issues of model building, etc., and just focus on the issue of normality that was raised. It is not necessary to have normally distributed residuals to make proper inferences about the regression coefficient. Before quoting the authorities, consider this simple example. I think that everyone here recognizes that if you calculate a mean, the mean will be asymptotically normally distributed. It is just a matter of sample size until the normality kicks in adequately. (Let's assume that the underlying population distribution of the statistic of interest has a finite variance.)
The mean is a regression coefficient. it is the intercept (the coefficient of "1", the value of a variable that is constant for all observations.) If you have a highly skewed distribution, even a horrible distribution--with finite variance, the intercept (estimating the mean of the population) will approach a normal distribution at some point and you can do proper inference about the population mean from this regression intercept coefficient and its standard error. But, if you subtract the sample mean (i.e., the regression intercept) from all of the observations, the residuals will still be skewed or horrible. They will look horrible and non-normal no matter how large the sample size is, yet you are making an appropriate inference about the mean--i.e., that simple intercept regression coefficient.
There is a wonderful article by Lumley, Diehr, Emerson and Chen that is on this very topic. Here is the reference:
Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC HEALTH DATA SETS. Annual Review of Public Health, Vol. 23: 151-169 (Volume publication date May 2002.) Here is a quote from that article:
"Normality is not required to fit a linear regression; but Normality of the coefficient estimates [beta-hat] is needed to compute confidence intervals and perform tests. As [beta-hat] is a weighted sum of Y (see Appendix 1), the Central Limit Theorem guarantees that it will be normally distributed if the sample size is large enough, and so tests and confidence intervals can be based on the associated t-statistic." I had to type in "beta-hat" because the Greek and its hat did not cut and paste well.
In any case, Lumley et al are pointing out that the regression coefficients are under the shelter of the Central Limit Theorem, given enough sample size. Because Dale's data are quite bounded (scores of 5-35 if the behavior is present) and the non-behaviors are about 15%, and the sample size is 150, I do not see any problem with the exploratory analysis that I mentioned (OLS) simply to identify the types of variables that are associated with these outcomes. Others in this thread have made other suggestions for modeling, and those should be considered, too, aside from the issue of testing coefficients.
If Dale and his investigators decide to go forth and fully model this thing, then the other choices (Tobit, Ancova-like model, etc.) may be helpful.
In journals, there usually seems to be a presentation, a comment and then a rejoinder, So, since I made my presentation, JoAnn made a comment, and I have provided a rejoinder, I am probably going to stop here, but others may wish to comment. It would be fun.
Best wishes,
Nayak
-------------------------------------------
Nayak Polissar
Principal Statistician
The Mountain-Whisper-Light Statistics
-------------------------------------------
Original Message:
Sent: 05-27-2014 00:14
From: JoAnn Alvarez
Subject: appropriate estimator/procedure for nonnormal response variable
Hi Dale,
I actually disagree with Nayek's response on a couple of points:
"What is your sample size? The normality assumption needed if you are getting p-values from OLS is that the coefficient estimates are normally distributed (or, close enough.). If you have a very large sample size, then your inference from p-values is probably fine. Whatever the sample size, you may wish to fit a two-stage model: fit a logistic regression model first to the dichotomous indicator (dep. var.) for presence/absence of a behavior and then linear regression for the score (dep. var) among those with a a non-zero-score (i.e., the behavior occurred.)"
While I often start by looking at the distribution of the response variables in deciding on a model, the assumptions for OLS are not about the response variable itself, but about the residuals. (Nayek is saying that the assumption you a relying on is that the coefficient estimates are normally distributed. In fact, the assumption that must be satisfied is that the residuals are independent and normally distributed, and as a *result,* the coefficient estimators are normally distributed.) It is also not about sample size here.
I would fit a carefully-selected OLS model with the predictors you are interested in and look only at the distribution of the residuals. If they are ok, you could proceed with the inference, but if they are skewed as you suspect they might be, you could try another option.
You have said that your data are not natural count data, but if a Poisson model fits your data, there is not any reason to avoid using that model. I'm not sure if you've done this already, but If you look at the questions themselves, and maybe discuss this particular issue with your investigators, there might in fact be a count-type interpretation for the score.
A good option would be to use proportional odds logistic regression. That would require you to assume that your response is naturally ordered, which it sounds like it is, and that the coefficients for each way of dichotomizing your response are the same.
JoAnn
-------------------------------------------
JoAnn Alvarez
Biostatistician
Department of Biostatistics, Vanderbilt Univ School of Medicine
-------------------------------------------
Original Message:
Sent: 05-26-2014 17:00
From: Dale Smith
Subject: appropriate estimator/procedure for nonnormal response variable
Thanks for the quick response, Nayak,
The sample size is about 150. I have looked at separate logistic regressions and OLS regressions for the risk DVs, splitting the sample in some screening analyses, and (interestingly) the predictors seem to change somewhat. This may make for an interesting (and more in-depth) analysis moving forward, but for now the researchers I am working with are just hoping for a general answer to their questions about what types of variables may serve as predictors of risk. And, yes, we are doing this behavior by behavior, and are finding that the predictors likely differ by behavior as well (though with some common factors as well).
Also- the 'young professionals group' is one of the many communities in the American Statistical Association.
Thanks again,
Dale
-------------------------------------------
Dale Smith
-------------------------------------------