Discussion: View Thread

appropriate estimator/procedure for nonnormal response variable

  • 1.  appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 16:28
    I am bringing this from the young professionals group in hopes of a few additional thoughts.
    I'm examining a variety of predictors of several health-risk behaviors (using a composite score (scaled 5-35) on a questionnaire for risk made up of ordinal-scale questions). My initial plan was to use OLS multivariate regression, since the health DVs are correlated, but the fact that about 15% of my sample is simply not engaging in such behaviors has led to violations of multivariate (& obviously univariate) normality. The typical transformations do not adequately correct this, and fitting skew-normal or skew-t distributions appears to improve the fit only marginally, likely due to the two clearly separate distributions.
    Since the data are not count-based, ZIP and ZINB are not likely appropriate, and finite mixture models (fmm) seem to fail to converge in STATA, possibly due to the 'distribution' of 'zeros' being non-normal, though the distribution of those who do engage in the behaviors does appear normal. I'm not aware of a command that allows for two separate distribution forms for fmm in STATA. Additionally, the primary research question is whether several predictor variables can be used to predict presence and magnitude of risk (ideally together).
    Would robust regression (rreg in Stata) or quantile regression with bootstrapping be inappropriate for such a distribution?
    Any suggetions other than discretizing? All thoughts are welcome and appreciated.

    Dale

    -------------------------------------------
    Dale Smith
    -------------------------------------------


  • 2.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 16:48
    Hello Dale,

    What is your sample size? The normality assumption needed if you are getting p-values from OLS is that the coefficient estimates are normally distributed (or, close enough.). If you have a very large sample size, then your inference from p-values is probably fine. Whatever the sample size, you may wish to fit a two-stage model: fit a logistic regression model first to the dichotomous indicator (dep. var.) for presence/absence of a behavior and then linear regression for the score (dep. var) among those with a a non-zero-score (i.e., the behavior occurred.)  Our team does these kinds of analyses, but I do not do the computing on this, so I do know specifically how you get the p-value, but someone will. 

    By the way, it sounds like you are doing this analysis behavior by behavior, right? I.e., you might have different models for different behaviors. 

    What is the "young professionals group"? Sounds interesting. 

    Nayak



    -------------------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light Statistics
    -------------------------------------------








  • 3.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:01
    Thanks for the quick response, Nayak,
    The sample size is about 150. I have looked at separate logistic regressions and OLS regressions for the risk DVs, splitting the sample in some screening analyses, and (interestingly) the predictors seem to change somewhat. This may make for an interesting (and more in-depth) analysis moving forward, but for now the researchers I am working with are just hoping for a general answer to their questions about what types of variables may serve as predictors of risk. And, yes, we are doing this behavior by behavior, and are finding that the predictors likely differ by behavior as well (though with some common factors as well).

    Also- the 'young professionals group' is one of the many communities in the American Statistical Association.
    Thanks again,
    Dale

    -------------------------------------------
    Dale Smith
    -------------------------------------------








  • 4.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:06


    -------------------------------------------
    Laurel Beckett
    -------------------------------------------
    Have you thought about a zero-inflated Poisson or other zero-inflated model? That would allow you to consider predictors of complete absence of risky behaviors separately from predictors of how many risky behaviors, or how intense they are, among those who do have some risky behavior. 
    Laurel







  • 5.  RE:appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:16
    Thanks Laurel.
    Yes, I think this would certainly be ideal. Unfortunately count data were not collected, and zero-inflated Gaussian models don't seem nearly as common in the literature (I could not find an analogue or application among Stata users).

    -------------------------------------------
    Dale Smith
    -------------------------------------------





  • 6.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:42
    Another possibility is to add binary dummy variables for the zero/"no-bad-behaviors" folks vs not, and add in the interaction terms
    by multiplication with the score variables.

    <http://www.kenbenoit.net/courses/quant1/Quant1_Week10_interactions.pdf> shows a simple example.

    This would essentially give you a multivariate ANCOVA-like model.  It doesn't necessarily solve all the normality-assumption
    problems, but it keeps everything in one analysis, and has the advantage of being relatively straightforward to understand
    and explain.

    -------------------------------------------
    Katherine Godfrey
    -------------------------------------------








  • 7.  RE:appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 20:00
    Thank you Katherine,
    I had not considered adding binary risk variable interaction terms. It certainly sounds worth looking into.

    -------------------------------------------
    Dale Smith
    -------------------------------------------





  • 8.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:22


    -------------------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light Statistics
    -------------------------------------------
    Hi Dale,

    If n = 150, and there are 15% or fewer of the people without a specific behavior, and the investigators are  "just hoping for a general answer to their questions about what types of variables may serve as predictors of risk", I would go ahed wit the OLS and tell the investigators that this is a preliminary (but decent) look and more definitive modeling would need more time. If a predictor influences occurrence and/or intensity of a behavior, then the OLS should give a positive coefficient in a univariate analysis. When you go multivariate, a host of other issues appear.

    Also, it is fine if the predictors differ between the log reg and the linear reg models in the 2-stage method (zero-inflated data.) That may be the reality of the phenomenon. 

    And, lurking in the background here is the multiple testing issue. If you have lots of predictors and/or lots of behaviors, you may have lots of false positives (type I error.) 

    Good luck!

    Nayak









  • 9.  RE:appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 17:34
    Thanks again for the helpful response, Nayak.
    Also, I completely agree on your points concerning different predictors for log and continuous components (I think it makes the topic particularly interesting) and familywise error concerns (which is why I would like to stick with one general analysis, rather than running pieces separately).

    -------------------------------------------
    Dale Smith
    -------------------------------------------





  • 10.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 00:15
    Hi Dale,

    I actually disagree with Nayek's response on a couple of points:

    "What is your sample size? The normality assumption needed if you are getting p-values from OLS is that the coefficient estimates are normally distributed (or, close enough.). If you have a very large sample size, then your inference from p-values is probably fine. Whatever the sample size, you may wish to fit a two-stage model: fit a logistic regression model first to the dichotomous indicator (dep. var.) for presence/absence of a behavior and then linear regression for the score (dep. var) among those with a a non-zero-score (i.e., the behavior occurred.)"

    While I often start by looking at the distribution of the response variables in deciding on a model, the assumptions for OLS are not about the response variable itself, but about the residuals. (Nayek is saying that the assumption you a relying on is that the coefficient estimates are normally distributed. In fact, the assumption that must be satisfied is that the residuals are independent and normally distributed, and as a *result,* the coefficient estimators are normally distributed.) It is also not about sample size here. 

    I would fit a carefully-selected OLS model with the predictors you are interested in and look only at the distribution of the residuals. If they are ok, you could proceed with the inference, but if they are skewed as you suspect they might be, you could try another option.

    You have said that your data are not natural count data, but if a Poisson model fits your data, there is not any reason to avoid using that model. I'm not sure if you've done this already, but If you look at the questions themselves, and maybe discuss this particular issue with your investigators, there might in fact be a count-type interpretation for the score. 

    A good option would be to use proportional odds logistic regression. That would require you to assume that your response is naturally ordered, which it sounds like it is, and that the coefficients for each way of dichotomizing your response are the same.

    JoAnn



    -------------------------------------------
    JoAnn Alvarez
    Biostatistician
    Department of Biostatistics, Vanderbilt Univ School of Medicine
    -------------------------------------------








  • 11.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 02:18
    Well, this is probably the best thing for Dale and the young professionals--to see a discussion among other statisticians.So, I do appreciate the opportunity to respond to JoAnn, who took the trouble to help Dale out, which is the goal for all of us. (Why else would I be doing this at 10:30 at night! But, JoAnn beat me there. Her post is timed at a heroic12:15 am.)

    I am going to ignore issues of model building, etc., and just focus on the issue of normality that was raised. It is not necessary to have normally distributed residuals to make proper inferences about the regression coefficient. Before quoting the authorities, consider this simple example. I think that everyone here recognizes that if you calculate a mean, the mean will be asymptotically normally distributed. It is just a matter of sample size until the normality kicks in adequately. (Let's assume that the underlying population distribution of the statistic of interest has a finite variance.)

    The mean is a regression coefficient. it is the intercept (the coefficient of "1", the value of a variable that is constant for all observations.) If you have a highly skewed distribution, even a horrible distribution--with finite variance, the intercept (estimating the mean of the population) will approach a normal distribution at some point and you can do proper inference about the population mean from this regression intercept coefficient and its standard error. But, if you subtract the sample mean (i.e., the regression intercept) from all of the observations, the residuals will still be skewed or horrible. They will look horrible and non-normal no matter how large the sample size is, yet you are making an appropriate inference about the mean--i.e., that simple intercept regression coefficient. 

    There is a wonderful article by Lumley, Diehr, Emerson and Chen that is on this very topic. Here is the reference: 
    Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC HEALTH DATA SETS. Annual Review of Public Health, Vol. 23: 151-169 (Volume publication date May 2002.)  

    Here is a quote from that article:
    "Normality is not required to fit a linear regression; but Normality of the coefficient estimates [beta-hat] is needed to compute confidence intervals and perform tests. As [beta-hat] is a weighted sum of Y (see Appendix 1), the Central Limit Theorem guarantees that it will be normally distributed if the sample size is large enough, and so tests and confidence intervals can be based on the associated t-statistic." I had to type in "beta-hat" because the Greek and its hat did not cut and paste well.

    In any case, Lumley et al are pointing out that the regression coefficients are under the shelter of the Central Limit Theorem, given enough sample size. Because Dale's data are quite bounded (scores of 5-35 if the behavior is present) and the non-behaviors are about 15%, and the sample size is 150, I do not see any problem with the exploratory analysis that I mentioned (OLS) simply to identify the types of variables that are associated with these outcomes. Others in this thread have made other suggestions for modeling, and those should be considered, too, aside from the issue of testing coefficients. 

    If Dale and his investigators decide to go forth and fully model this thing, then the other choices (Tobit, Ancova-like model, etc.) may be helpful.

    In journals, there usually seems to be a presentation, a comment and then a rejoinder, So, since I made my presentation, JoAnn made a comment, and I have provided a rejoinder, I am probably going to stop here, but others may wish to comment. It would be fun.

    Best wishes,

    Nayak









    -------------------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light Statistics
    -------------------------------------------








  • 12.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 09:07


    -------------------------------------------
    Nagaraj Neerchal
    Professor and Chair
    UMBC
    -------------------------------------------


    One more point that might be relevant in this context:

    While Normal-theory based inference regarding the model parameters are valid in large samples regardless of the normality assumption, the construction of prediction (or forecast) intervals would still need distributional assumptions.

    Nagaraj
    Professor and Chair
    Math and Stat, UMBC




  • 13.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-29-2014 10:40
    Thanks, Nayak. Please see my responses in-line.

    -------------------------------------------
    JoAnn Alvarez
    Biostatistician
    Department of Biostatistics, Vanderbilt Univ School of Medicine
    -------------------------------------------



  • 14.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-26-2014 23:51
    Have you tried tobit regression? It fits a truncated normal distribution and a point mass at zero (or in your application, a point mass at 5).

    -------------------------------------------
    Stephen Simon
    Independent Statistical Consultant
    P. Mean Consulting
    -------------------------------------------








  • 15.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 14:34
    Thank you, everyone, for the very helpful and considerate responses. Obtaining the help and opinions of such experts in the field certainy makes these boards invaluable.
    A few notes to respond to some points made.
    -Residual vs. fitted and residual vs. predictor plots do not indicate particularly problematic trends, though standardized residuals are not normal to the extent that they 'pass' Kolmogorov-Smirnov or Shapiro-Wilk tests at this sample size.
    -Risk measure distributions show a peak at 'zero' followed by an approximately normal distribution, though I will certainly examine the effectiveness of the Poisson model (or ZIP models) to fit to the data as we move forward with model building.
    -I had not considered the Tobit model, as the dataset contained no intuitive censoring/truncating, though the distribution and residual plots do appear to (somewhat) approximate what I have seen in some prior applications of Tobit. Perhaps this is worth a look as well.
    Thanks again,
    Dale


    -------------------------------------------
    Dale Smith
    -------------------------------------------








  • 16.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 15:19
    Dear Dale,

    Since your outcome seems a mixture of two distributions, I would prefer to explore either zero inflated normal or zero hurdle normal. There is no direct command for these regressions in STATA or SAS but you can write program using "ml" command in STATA or proc nlmixed command in SAS. You may also try log gamma regression. If this does not work then you may use semi-parametric approach such as quantile/logistic quantile regression with bootstrap standard error. These approaches will allow you to appropriately predict outcome as well as draw inference. Robust regression can also be used if your goal is to draw inference only. Another way is to categorize outcome into meaningful three or more categories and use multinomial logit.  

    Thanks
    Alok

    -------------------------------------------
    Alok Dwivedi
    Assistant Professor
    Texas Tech University Health Sciences Center
    -------------------------------------------








  • 17.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 15:38
    SAS now has a procedure called PROC FMM which can fit hurdle models. nagaraj ------------------------------------------- Nagaraj Neerchal Professor and Chair UMBC -------------------------------------------


  • 18.  RE: appropriate estimator/procedure for nonnormal response variable

    Posted 05-27-2014 15:55
    Thanks Dr. Neerchal.  

    -------------------------------------------
    Alok Dwivedi
    Assistant Professor
    Texas Tech University Health Sciences Center
    -------------------------------------------