Discussion: View Thread

  • 1.  small sample size with odds ratios

    Posted 09-08-2016 14:33

    An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected  controls (did not have the outcome of interest).  We've grouped scores into 10 bins, using 0 as the reference bin.  The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all.  The other bins represent increasing levels of the behavior.  (The way the scale was developed, it does not make sense to use it as a continuous variable)  


    Here's my problem:  there are only 5 cases in the reference bin.  With so few cases as the reference, I'm concerned that the odds ratios are unstable.  For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)".  (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output).  I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference.  I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.



    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
     


  • 2.  RE: small sample size with odds ratios

    Posted 09-08-2016 14:47
    Please give more information. The interval scale of 0 to 1000 is a bit strange. How do yo know that is is a true interval scale. It is not just the statistics, but the underlining data is also a consideration. Have you thought of transforming the data to make it more normal?

    Michael



    ------Original Message------

    An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected  controls (did not have the outcome of interest).  We've grouped scores into 10 bins, using 0 as the reference bin.  The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all.  The other bins represent increasing levels of the behavior.  (The way the scale was developed, it does not make sense to use it as a continuous variable)  


    Here's my problem:  there are only 5 cases in the reference bin.  With so few cases as the reference, I'm concerned that the odds ratios are unstable.  For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)".  (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output).  I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference.  I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.



    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
     


  • 3.  RE: small sample size with odds ratios

    Posted 09-08-2016 15:05

    Nancy-

    I would like to help but I am still a little unclear on the data and objective.

    1.  Is the binary outcome of interest present in all of the 1,500 cases and not present in the 150,000 controls?

    2. Do you have the score value for the behavior (0 to 1000) for both groups?

    3. Is the objective to see if the score is different between the 2 groups or use the score to predict the binary outcome or?

    4. What is the importance of score = 0?

    5. Can you show us histograms of the behavior score?

    Thanks,

    -Walt

    ------------------------------
    Walter Flom



  • 4.  RE: small sample size with odds ratios

    Posted 09-08-2016 15:06
    You have good statistical insights, Nancy! Yes, I think looking at the data as a plot and seeing if there are some natural bins is the right thing to do. Maybe it sounds like you did that. Anyway, you're definitely getting results that suggest it's not working. There are probably methods to find optimal bin size, I don't know. But finding cuts that are natural in the distribution, OR you can write a scientific finding about seem like the way to go. No point in distinguishing 0 from 1 if they're essentially the same thing (or if virtually no one is a 0). 

    Or, normalizing and treating as continuous, as another as Michael mentioned.

    Info on the original questions/coding would be helpful of course. 

    Good luck!

    -Matt


    ------Original Message------

    An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected  controls (did not have the outcome of interest).  We've grouped scores into 10 bins, using 0 as the reference bin.  The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all.  The other bins represent increasing levels of the behavior.  (The way the scale was developed, it does not make sense to use it as a continuous variable)  


    Here's my problem:  there are only 5 cases in the reference bin.  With so few cases as the reference, I'm concerned that the odds ratios are unstable.  For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)".  (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output).  I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference.  I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.



    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
     


  • 5.  RE: small sample size with odds ratios

    Posted 09-08-2016 15:38
    What you need is the "blood from a turnip test." There are lots of
    alternative ways of analyzing this data set, and you need to convince
    your client of the folly of his/her approach given the large number of
    better alternatives.

    There are several ways to do this. One is to show the asymptotic formula
    for the standard error for log odds ratio for a two by two table:
    sqrt(1/a+1/b+1/c+1/d) where a-d are the cell counts. The precision is
    clearly dominated by the smallest cell count. Set up a spreadsheet with
    this standard error and show how the standard error is pretty much the
    same bad value when you increase any of the other cell sizes, but when
    you increase the 5 cell, it drops dramatically.

    You could compute the effective sample size as the harmonic mean of the
    four cell counts and compare that to other harmonic means using other
    reference categories. By insisting on using the zero bin as a reference,
    your client is reducing the effective sample size by a factor of at
    least a hundred. You could argue that this is an ethical
    violation--collecting thousands of observations and then frittering them
    away with an analysis that would have had the same precision if you had
    collected only a few dozen observations.

    You could run a sensitivity analysis where you change one of the case
    values and show how sensitive your results are to the change of a single
    value out of thousands. This could also be easily set up in a spreadsheet.

    You could also argue that sample sizes of 5 are unpublishable. We all
    know that's a white lie, but I'm not above using white lies to talk
    someone out of a bad data analysis choice.

    Since there are a gazillion alternatives that are better, I hesitate to
    suggest any, but one possibility is to swap the variables so that the
    bins become your outcome variable. Then you can use ordinal logistic
    regression, which is going to be fairly insensitive to the small cell
    counts in one of the bins. Another possibility is to fit some type of
    spline model which would allow you to show graphically what the
    predicted log odds would be at 0 and how those log odds change as the
    original variable (or the variable transformed to bin number) changes.
    Again, the spline would be better because the prediction at zero will
    rely on the very reasonable assumption of continuity and allows you to
    use information from the values close to the reference category to
    improve precision. Every approach has disadvantages, of course, but it
    would be hard to argue that any disadvantage is as bad as having a cell
    size of 5.

    Steve Simon, blog.pmean.com

    ------Original Message------

    An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected  controls (did not have the outcome of interest).  We've grouped scores into 10 bins, using 0 as the reference bin.  The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all.  The other bins represent increasing levels of the behavior.  (The way the scale was developed, it does not make sense to use it as a continuous variable)  


    Here's my problem:  there are only 5 cases in the reference bin.  With so few cases as the reference, I'm concerned that the odds ratios are unstable.  For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)".  (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output).  I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference.  I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.



    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
     


  • 6.  RE: small sample size with odds ratios

    Posted 09-08-2016 18:50

    If n=5 cases is adequate for any group (or bin), it is adequate for the reference group. Being selected to be the reference group does not confer any special privileges or restrictions on the sample-size front. So the question becomes, is n=5 cases adequate for any group? 

    In section 6.6 of his Statistical Rules of Thumb, Gerald van Belle cites a simulation study by Peduzzi et al 1996, the conclusion of which was that one needs 10 expected events per variable in one's logistic-regression model. In a univariable logistic regression where the single variable is binary (such as treatment versus placebo), the 10 expected events per variable becomes 5 expected events per treatment arm. Since being a case seems to be the event of interest, it would appear that having n=5 cases in the group is the bare minimum considered adequate for univariable logistic regression, and inadequate for multivariable logistic regression.

    However, sometimes arguments like that don't work on investigators. What you may have to do in the end is, show the investigator how the results change if you re-define the reference bin to be 0+1 or 0+1+2 instead of 0 alone.   

    ------------------------------
    Eric Siegel, MS
    Research Associate
    Department of Biostatistics
    Univ. Arkansas Medical Sciences



  • 7.  RE: small sample size with odds ratios

    Posted 09-09-2016 13:26

    Why create bins at all?  Why not just use logistic regression with score a continuous independent variable?

    Or, if group is the independent variable, you have a standard Y by X with categorical X (group) and continuous (score) Y.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org