Discussion: View Thread

How to categorize score?

  • 1.  How to categorize score?

    Posted 06-30-2012 02:59
    Dear All,

    I have a questionnaire for which if the total score is >25 then the person is considered to be in "total control", between 20-24 then "well controlled" and for less than 20 the person is in " out of control "

    In my sample of 566 individuals, 404 are "out of control" and 162 are in " well controlled" and no one in in the "total control" category.

    Will it make sense to include the score in the regression model as binary "out of control" versus "well controlled" or using it as a categorical variable with three category make more sense?

    Looking forward for suggestions and comments.

    Best Regards,
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------


  • 2.  RE:How to categorize score?

    Posted 06-30-2012 08:25
    I would avoid categorizing the score at all; it is rarely a good idea to categorize continuous or nearly continuous variables. It throws away information and invokes "magical thinking" that is, it says something amazing happens at the cutoff point. In your example, it treats 19 as being radically different from 20, but 19 as the same as the lowest score.

    If you must categorize, you can't use a 3 level model if one level has no people in it.

    -------------------------------------------
    Peter Flom
    -------------------------------------------








  • 3.  RE:How to categorize score?

    Posted 06-30-2012 12:54
    I completely agree with Peter.  Quantitative data is oftentimes more rich than categorical data.  In addition, for scale generation, it is important that the distances between selections are clear.  To elaborate on Peter's example:

    In the proposed scheme, if the "out of control" group is coded as 0 and the "well controlled" group coded as 1, then:

    - A numerical score of 20 is just one scale unit away from 19 AND is just one scale unit away from 1. 

    If you use the numerical score as is, then:

    - 20 is just one unit from 19 and is a full 19 units away from 1.

    This seems to be more likely since a person with a 20 score is probably not much different from a person with a score of 19 (even though they are in different groups), but is probably completely different than a person with a score of 1.

    As another example, consider the typical grading scale where 60-69 is a D, 70-79 is a C, 80-89 is a B, and 90-100 is an A.  A student who scores an 89 will receive a B while a person who scores a 90 will receive an A.  If a bar chart was created to display the grade distribution, the 89 student will get lumped into the same group as the student who scores an 80.  But, it is most likely the case that the 89 student knows about as much as the student who scores the 90.  I would be leary of imposing a "B-student" intervention on the 89 student assuming that his needs are the same as the other B students.

    On a different note - I am really curious, did the project expect to observe any "totally controlled" people?  If so, what are the effects of not observing these people on your regression?  Should there be a penalty imposed on your regression to account for the fact that these people and their characteristics are unanalyzed?

    I'd be interested to hear comments from others in this eGroup.   



    -------------------------------------------
    Raymond Mooring
    Senior Statistical Consultant
    Analysis Made Easy
    -------------------------------------------








  • 4.  RE:How to categorize score?

    Posted 06-30-2012 13:44
    Hello all,

    I agree that if the purpose is to find or measure an association of something with this control spectrum variable, then the continuous score is the way to go. This sounds like research, so that is probably what is needed.

    However, in some settings there are consequences of getting above a certain cutpoint. In that case you may wish to see what the probability of getting that consequence is in relation to some other variable. For example, in our state (Washington) there are enhanced penalties for an arrested driver whose breath alcohol is determined to be above a concentration of 15**. So, the question is: what is the probability of getting the enhanced penalty in relation to some other variable, X? Logistic regression would be a good way to go for that analysis (with the usual careful diagnostics that go with model-fitting.)

    Even if the logistic regression is done, it would be good to analyze the data using the dependent variable in its continuous form for a deeper understanding of the phenomenon, 

    Best wishes,

    Nayak

    **The units of breath alcohol measurement are grams/210 liters.   

    -------------------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light Statistics
    -------------------------------------------








  • 5.  RE:How to categorize score?

    Posted 06-30-2012 13:55
    This does sound like one of the psychological scales that I often see.  While the cutoffs seem somewhat arbitrary, the psychological literature deals with the diagnosis, rather than as a continuous outcome.  And since most of these scales are non-linear, I usually end up categorizing them to look at the relationship without putting an a priori model on it.  A smoother is another approach that helps if you are examing the interpretation of the scale (I usually use a kernal smoother with some adjustment of the bandwidth), but there are LOTS of choices for smoothers. 

    So it does depend on what your goal is as to whether to use the categories or whether to explore the ordinal relationship. 

    Going along with the assumption that the scale is diagnostic of pathology, the question about what you expect is very important to think about; it is not unusual to find no extreme pathology in a "normal" group; conversely, if you expect to find a lot of pathology, your results are an indication of potential errors in coding (reverse coded questions are always tricky, especially if mixed in with normal coded questions.).  I don't think we ever get a data set from the various clinicians that we deal with that doesn't have any range errors or coding errors or data entry errors, etc.  So it is always good to think about whether the data makes sense.

    Ray

    -------------------------------------------
    Raymond Hoffmann
    Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 6.  RE:How to categorize score?

    Posted 06-30-2012 14:42
    Yes, diagnosis is often dichotomous. To some extent, this is legitimate. But only a limited extent. It is legitimate only to the extent that treatment is dichotomous. But often, it is not.

    One situation where this happens a lot is predicting low birth weight of infants, with the usual cutoff being 2.5 kg. But does this make any sense, even diagnostically? Should a baby that weighs 2.49 kg be treated the same as one who weighs 1.5 kg? Probably not. There's a range of treatment. Some are dichotomous (e.g. NICU or not NICU?) but others are not - e.g. degree of "watch" that is placed on the infant.  And there are often other symptoms as well (e.g. APGAR score).

    Closer to the current case, some scales (e.g. the Beck Depression inventory) are categorized. Yet, does this make sense? Is treatment dichotomous? No. Therapy can be more or less frequent. Dosages of drugs can be higher or lower. The way the problem is explained to the patient can vary. And, again, there are other symptoms.

    Re "In control" - again, neither the condition nor the treatment is dichotomous

    I think I will do a blog post on this soon

    -------------------------------------------
    Peter Flom
    -------------------------------------------








  • 7.  RE:How to categorize score?

    Posted 06-30-2012 22:40
    Tasneem,
    The practical answer to your question is, if you try it both ways, you should get the same answer.  If your variable is formatted to be categorical with three categories, but one of your categories has no observations in it, your statistical program will treat the variable as being binary. 

    (Try it both ways and see what happens.)   

    -------------------------------------------
    Eric Siegel
    Biostatistician
    Univ of Arkansas for Medical Sciences
    -------------------------------------------








  • 8.  RE:How to categorize score?

    Posted 07-02-2012 10:30
    I believe the consensus is not to categorize (dichotomize) a continuous scale.  Let me give a list of reasons not to dichotomize.
    1. Power - you need to enroll more patients into your trial.
    2. We throw away interval level information (hence means) and ordinal level information (hence medians).
    3. The statistical approaches often assume large Ns.
    4. The statistical approaches limits the type of analyses.
    Power
    At BEST (if you dichotomize at the median) you throw away information.  At BEST, you would need to increase the study's N by 60%.  If the dichotomy was at 90:10, you would need to increase N by a factor of four.

    Interval Level info
    As was pointed out before, what is the difference between a score of 20 and 24?  Zero.  Between a score of 1 and 19?  Zero.  Between 19 and 20?  Well they are maximally different.  If you dichotomize and use the dichotomy for analysis, how do you summarize the data.  Well, if the data are dichotomous, you have no right to present means, sd, medians, min or max.  You can only summarize the proportions.  Again you are throwing all that information out.

    Large N
    The analogue for the analysis of variance test is the logistic regression test.  One of its key assumptions is something called 'asymptotic normality'.  What that means is that it assumes that the Ns need to be quite large. Logistic regression routinely uses hundreds of observations.

    Limit of type of analysis
    With a dichotomy, one can easily compare your scale and some outcome.  However, having two i.v. is a problem.  Take time.  Assuming you measured your control scale at three time points.  How do you analyze it.  If continuous you would do a control, time, and control by time model with some d.v.  Can you do a full two-way model (and allow for correlated errors in time)?  Perhaps, only by very cumbersome models.  Trivial for continuous data.

    Conclusion:  I agree that dichotomizing data into success and failure makes interpretation much easier.  However, to plan a trial for a dichotomy would necessitate at a minimum a 60% increase in patients.  A small study would, at best, need to be doubled in size.  If the split into the two groups is not the ideal 50/50, then the increase would need to be much larger.  A statistical analysis of a dichotomy also requires a large N.  It also makes factorial designs almost impossible to analyze or interpret.

    Recommendation:  If simplicity of interpretation is desired, then analyze the data as a continuum, but present (descriptive [no p-values or CI]) summary tables with the dichotomy. 

    -------------------------------------------
    Allen Fleishman
    Allen Fleishman Biostatistics Inc.
    -------------------------------------------




  • 9.  RE:How to categorize score?

    Posted 07-02-2012 10:42
    It's important to remember that these psychological scale are never interval.
    And while apparently ordinal, they are rarely, if ever, evaluated to see if they actually are ordinal.

    So as stated by many, it is important to not believe the thresholds, but in modeling the relationships,
    it is important not to assume simple continuous models.  Indeed, often the relationship is threshold
    among many other possibilities.  Again with sufficient numbers, a smoother will give a lot of insight
    into the relationship.

    Ray

    -------------------------------------------
    Raymond Hoffmann
    Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 10.  RE:How to categorize score?

    Posted 07-02-2012 12:54
    Dear All,

    Thank you for your comments and suggestions.

    I will  discuss some points raised and most probably stick to my original analysis which is same as what Allen suggested that  analyze the data as a continuum, but present descriptive  summary tables with the dichotomy.

    Raymond,  I have noted your point  and after further discussion with the group I will get back to you on your question  "if the project expect to observe any "totally controlled" people? 

    Actually, it's a very good question, however, I am not sure if the project did expect totally controlled people, how can I impose  a penalty on my regression
     to account for the fact that these people and their characteristics are not analyzed. I would also appreciate any further comments and suggestions from the e-group n it.

    Thank you 
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------








  • 11.  RE:How to categorize score?

    Posted 07-02-2012 13:26
    Tasneem-
    Consider also presenting summary statistics for continuous variables, if included in the analysis as such. Then summary statistics will be consistent with the analysis.
    David


    -------------------------------------------
    David Bristol
    Statistical Consulting Services
    -------------------------------------------








  • 12.  RE:How to categorize score?

    Posted 07-02-2012 13:37

    Thanks David, I will for sure do that.
    Best Regards,
    Tasneem
    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------