ASA Connect

 View Only
Expand all | Collapse all

To treat as categorical or continuous

  • 1.  To treat as categorical or continuous

    Posted 05-11-2016 10:12

    Hello All,

    I was conducting an analysis in which the main point was to determine if an interaction was significant. My outcome was a a count of how many correct answers a participant received from a set of 7 questions (range from 0 to 7 although everyone had at least 1 correct answer). All predictors, excepting one were categorical. It was first suggested that I use a linear regression model but upon reviewing the data it didn't appear that the assumptions were met despite a sample size of about 500. I then went on to use Poisson regression as I have been taught that such a model is good for count data but the model fit statistics suggested that the model wasn't a good fit (Value/DF ~ 0.21), that it was in fact underdispersed. 

    My question is,

    1. What other steps should I have taken with the data in order to get a better model (if I am to keep all of the predictors in the model), especially if my main point is to determine if the interaction is significant?

    2. Should the dependent variable just have been transformed and then utilized in a linear regression model or does that lose interpretation?

    3. It seems that treating something as continuous when it is not is the default but is that always the case?

    4. What are some good sources of underdispersion,also what does that mean in a general sense?

    Thanks for any insight you all are able to provide.

    ------------------------------
    Nicole Mack
    ------------------------------


  • 2.  RE: To treat as categorical or continuous

    Posted 05-12-2016 03:41

    What do you want to find out from the data?

    Suppose that you want to create a predictive model of what factors contribute to right answers. You can use CART models or Random Forests.

    You might want to try a logistic regression. Redo the scores so that 7 out of 7 gets 100%, 0 out of 7 gets 0% and everyone else falls in between.  

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 3.  RE: To treat as categorical or continuous

    Posted 05-12-2016 04:39

    Hello Nicole

     

    My first question is whether it is appropriate to take the count of correct answers as the response variable. This assumes that there is some kind of homogeneity among the questions, so it doesn't matter which questions were answered correctly. Did all participants have exactly the same set of questions? If so, it is possible to see if some questions were more likely to be answered correctly than others – this might help to account for the fact that everyone had at least one correct answer, and might also have some relevance to the observed under-dispersion. With this set-up, it might be appropriate to consider a multivariate response – the 7-element vector of correct/incorrect scores. Maybe some form of logistic regression would be appropriate.

     

    If on the other hand different participants had different sets of questions, I think we need to know more about how the questions were selected, and how you can ensure appropriate homogeneity. Without more details it is not possible to make any useful suggestions.

     

    I hope this is helpful.

     

    Peter Kenny






  • 4.  RE: To treat as categorical or continuous

    Posted 05-12-2016 04:45

    Hi Nicole,

    The underdispersion is likely the result of an upper bound on the data (max value = 7).

    One option is to use a binomial model (e.g. 3 correct out of 7), which might be more appropriate. Another alternative is a proportional odds model (Wikipedia).

    I hope that helps.

    Kind regards,

    Stan

    ------------------------------
    Stanley Lazic



  • 5.  RE: To treat as categorical or continuous

    Posted 05-12-2016 06:41

    Nicole

    The assumption that a rv is Poisson entails that its mean is equal to its variance, which is a pretty strong assumption that often fails with real-world data. I would suggest replacing the Poisson with a negative binomial assumption. The negative binomial is the marginal distribution for a Poisson random variable when the rate parameter has a Gamma(alpha,beta) distribution. It's often used as a robust alternative to the Poisson.

    Blaise F Egan

    ------------------------------
    Blaise Egan
    Lead Data Scientist
    British Telecommunications PLC



  • 6.  RE: To treat as categorical or continuous

    Posted 05-13-2016 07:21

    On second thoughts, Stanley Lazic is right - the underdispersion is caused by the bounded nature of the data, so neither a Poisson nor a negative binomial is appropriate. I would go for a cumulative logistic regression. That would give you P(y <=1), P(y <=2),...P(y <=6). (No need for P(y <=7) as it is known to be 1.)

    Blaise F Egan

    ------------------------------
    Blaise Egan
    Principal Research Statistician
    British Telecommunications PLC



  • 7.  RE: To treat as categorical or continuous

    Posted 05-16-2016 15:44

    Not only is the response upper-bounded at 7, it also is the within-subject sum of 7 binary[0,1] sub-responses to the seven questions. Which raises the possibility that the response could be analyzed via logistic regression with a repeated-measures component, if one has access to the subjects' sub-responses to the individual questions. In SAS, Proc Genmod and Proc Glimmix come to mind.

    ------------------------------
    Eric Siegel, MS
    Research Associate
    Department of Biostatistics
    Univ. Arkansas Medical Sciences



  • 8.  RE: To treat as categorical or continuous

    Posted 05-12-2016 10:42

    Some other suggested approaches:

    Within a linear model context:

    1) with N=500 the central limit theorem should apply so linear regression should be OK, or, you can bootstrap it and avoid any distributional assumptions. Albeit there's the problem of confidence intervals or predicted values outside the possible range of-0-7.

    2) use a beta regression model. The beta distribution is bounded to 0-1, so you would need to rescale the range to [0.01 - 0.99]. The model is typically estimated with a logit link, so for interpretation you would need to back transform from logit to (0-1) and then from there to the original range. This can be done using the betareg package in R, or the proc Glimmix in sas.

    Non linear:

    3) a non parametric tree model, like the one on the R package party (uses an implementation of permutation tests for splitting the predictor space). No distributional assumptions, and predictions are within range. Hopefully you can obtain an interpretable tree, and maybe your interaction shows up.

         

    ------------------------------
    Andres Azuero
    UAB



  • 9.  RE: To treat as categorical or continuous

    Posted 05-12-2016 15:21

    Nicole,

    You may also consider using a robust regression method such as Least Median Squares (LMS) [1].  This will allow you to analyze the data without transformations and would probably give you some reasonable estimates given the nature of your outcome variable. 

    The other questions you pose regarding transformations and when to treat a variable as a continous one are ones that I have dealt with myself.  As far as transformations go, I use them sparingly and only in a few circumstances.  The reason for this is because the scale on which the transformation is based on often has no practical interpretation.  For example, I have used natural log transformations when dealing with biomarkers because you can then back-translate the mean of the transformed variable onto its original scale of measurement [2-4].  Otherwise, my preference is to use non-parametric tests on the raw data.

    Regarding whether the outcome should be treated as a continuous variable or otherwise, depends on what the variable is intended to measure and if it is intended be treated as a continuous variable.  For example, in autopsy studies of Alzheimer's disease cases a very common measurement of neuronal tangles is the Braak stage which goes from 0 to 6 (ordinal scale).  Ostensibly you could treat this as a continuous variable, but a mean of 3.37 is not particularly valuable since this is not a possible value for an individual case to have.  Treating this as a categorical variable is much more meaningful as the various stages represent distinct differences in the degree of pathology that is present.

    I hope all of this is helpful.

    Mike

    1. Rousseeuw, PJ. Least median of squares regression. Journal of American Statistical Association 1984;79:871–880.

    2. Bland JM, Altman DG. Transformations, means, and confidence intervals. British Medical Journal 1996;312:1079.

    3. Malek-Ahmadi M, Patel A, Sabbagh MN. KIF6 719Arg carrier status association with homocysteine and c-reactive protein in amnestic mild cognitive impairment and Alzheimer’s disease patients. International Journal of Alzheimer’s Disease 2013;2013:242303. doi:10.1155/2013/242303.

    4. Ravaglia G, Forti P, Maioli F, et al, Apiloproetein E e4 allele affects risk of hyperhomocysteinemia in the elderly. American Journal of Clinical Nutrition 2006;84:1473-1480.

    ------------------------------
    Mike Malek-Ahmadi
    Banner Alzheimer's Institute



  • 10.  RE: To treat as categorical or continuous

    Posted 05-12-2016 16:44

    Would treating it as negative binomial work?

    ------------------------------
    Gabriel Farkas



  • 11.  RE: To treat as categorical or continuous

    Posted 05-13-2016 10:50

    Nicole,

    There are so many different ways to approach your problem:

    1. STATA has a procedure called mlogit (stands for multinomial logit regression) is an option.

    2. SPSS has a procedure for multinomial logistic regression.

    3. R version 2.13.1: MLR; try- WWW.unt.edu/class/jon/benchmarks/MLR_JDS_Aug2011.pdf. Shows R code for MLR (by Drs. John Starkweather and Amanda Moske .

    4. SAS- Proc MLOGIT: try SAS Data Analysis Examples: Multinomial Logistic Regression. SAS Data Analysis Examples: Multinomial Logistic Regression

    5. Read the article by Ying So and Warren Kuhfeld on Multinomial logit model with SAS codes (www.Support.SAS.com/techsup/technote/mr2010g.pdf)

    6. Also might try SAS's Proc GEE, Proc GENMOD.

    7. SAS's Proc CATMOD (hopefully all the errors in the original introduction have been corrected).

    There maybe other programs that can do what you need.  Your sample size (500) being "large" the normal theory based regression techniques (with or without transformations) may not be all that inappropriate.

    Hope this helps.

    Dr. Ajit Thakur

    Retired Statistician

    ------------------------------
    Ajit Thakur
    Associate Director



  • 12.  RE: To treat as categorical or continuous

    Posted 05-16-2016 09:13

    The decision about a model should be made on the basis of what you want to measure rather than the distribution of the data. Except for very small samples or very strange data the central limit theorem protects the validity of your inference.  You could use linear regression with binary data if you were interested in additive changes.  This is especially true if you are testing for interactions, there are situations where there is no interaction in a linear model but there would be an interaction in a log linear model.  The first additive model would be no additive difference in response while the latter would be no proportional difference.

    ------------------------------
    David Schoenfeld
    Professor of Medicine
    Massachusetts General Hospital