Discussion: View Thread

Back to discussions

Expand all | Collapse all

small sample size with odds ratios

1. small sample size with odds ratios

Recommend
Nancy Buderer
Posted 09-08-2016 14:33
An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected controls (did not have the outcome of interest). We've grouped scores into 10 bins, using 0 as the reference bin. The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all. The other bins represent increasing levels of the behavior. (The way the scale was developed, it does not make sense to use it as a continuous variable)

Here's my problem: there are only 5 cases in the reference bin. With so few cases as the reference, I'm concerned that the odds ratios are unstable. For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)". (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output). I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference. I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.

--
Nancy Buderer, MS
Biostatistician and Research Consultant
nancy@budererdrug.com
2. RE: small sample size with odds ratios

Recommend
Michael Maranda
Posted 09-08-2016 14:47
Please give more information. The interval scale of 0 to 1000 is a bit strange. How do yo know that is is a true interval scale. It is not just the statistics, but the underlining data is also a consideration. Have you thought of transforming the data to make it more normal?

Michael

------Original Message------

An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected controls (did not have the outcome of interest). We've grouped scores into 10 bins, using 0 as the reference bin. The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all. The other bins represent increasing levels of the behavior. (The way the scale was developed, it does not make sense to use it as a continuous variable)

Here's my problem: there are only 5 cases in the reference bin. With so few cases as the reference, I'm concerned that the odds ratios are unstable. For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)". (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output). I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference. I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.

--
Nancy Buderer, MS
Biostatistician and Research Consultant
nancy@budererdrug.com
3. RE: small sample size with odds ratios

Recommend
Walter Flom
Posted 09-08-2016 15:05
Nancy-

I would like to help but I am still a little unclear on the data and objective.

1. Is the binary outcome of interest present in all of the 1,500 cases and not present in the 150,000 controls?

2. Do you have the score value for the behavior (0 to 1000) for both groups?

3. Is the objective to see if the score is different between the 2 groups or use the score to predict the binary outcome or?

4. What is the importance of score = 0?

5. Can you show us histograms of the behavior score?

Thanks,

-Walt
------------------------------
Walter Flom
4. RE: small sample size with odds ratios

Recommend
Matt Jans
Posted 09-08-2016 15:06
You have good statistical insights, Nancy! Yes, I think looking at the data as a plot and seeing if there are some natural bins is the right thing to do. Maybe it sounds like you did that. Anyway, you're definitely getting results that suggest it's not working. There are probably methods to find optimal bin size, I don't know. But finding cuts that are natural in the distribution, OR you can write a scientific finding about seem like the way to go. No point in distinguishing 0 from 1 if they're essentially the same thing (or if virtually no one is a 0).

Or, normalizing and treating as continuous, as another as Michael mentioned.

Info on the original questions/coding would be helpful of course.

Good luck!

-Matt

------Original Message------

An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected controls (did not have the outcome of interest). We've grouped scores into 10 bins, using 0 as the reference bin. The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all. The other bins represent increasing levels of the behavior. (The way the scale was developed, it does not make sense to use it as a continuous variable)

Here's my problem: there are only 5 cases in the reference bin. With so few cases as the reference, I'm concerned that the odds ratios are unstable. For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)". (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output). I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference. I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.

--
Nancy Buderer, MS
Biostatistician and Research Consultant
nancy@budererdrug.com
5. RE: small sample size with odds ratios

Recommend
Stephen Simon
Posted 09-08-2016 15:38
What you need is the "blood from a turnip test." There are lots of
alternative ways of analyzing this data set, and you need to convince
your client of the folly of his/her approach given the large number of
better alternatives.

There are several ways to do this. One is to show the asymptotic formula
for the standard error for log odds ratio for a two by two table:
sqrt(1/a+1/b+1/c+1/d) where a-d are the cell counts. The precision is
clearly dominated by the smallest cell count. Set up a spreadsheet with
this standard error and show how the standard error is pretty much the
same bad value when you increase any of the other cell sizes, but when
you increase the 5 cell, it drops dramatically.

You could compute the effective sample size as the harmonic mean of the
four cell counts and compare that to other harmonic means using other
reference categories. By insisting on using the zero bin as a reference,
your client is reducing the effective sample size by a factor of at
least a hundred. You could argue that this is an ethical
violation--collecting thousands of observations and then frittering them
away with an analysis that would have had the same precision if you had
collected only a few dozen observations.

You could run a sensitivity analysis where you change one of the case
values and show how sensitive your results are to the change of a single
value out of thousands. This could also be easily set up in a spreadsheet.

You could also argue that sample sizes of 5 are unpublishable. We all
know that's a white lie, but I'm not above using white lies to talk
someone out of a bad data analysis choice.

Since there are a gazillion alternatives that are better, I hesitate to
suggest any, but one possibility is to swap the variables so that the
bins become your outcome variable. Then you can use ordinal logistic
regression, which is going to be fairly insensitive to the small cell
counts in one of the bins. Another possibility is to fit some type of
spline model which would allow you to show graphically what the
predicted log odds would be at 0 and how those log odds change as the
original variable (or the variable transformed to bin number) changes.
Again, the spline would be better because the prediction at zero will
rely on the very reasonable assumption of continuity and allows you to
use information from the values close to the reference category to
improve precision. Every approach has disadvantages, of course, but it
would be hard to argue that any disadvantage is as bad as having a cell
size of 5.

Steve Simon, blog.pmean.com

------Original Message------

An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected controls (did not have the outcome of interest). We've grouped scores into 10 bins, using 0 as the reference bin. The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all. The other bins represent increasing levels of the behavior. (The way the scale was developed, it does not make sense to use it as a continuous variable)

Here's my problem: there are only 5 cases in the reference bin. With so few cases as the reference, I'm concerned that the odds ratios are unstable. For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)". (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output). I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference. I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.

--
Nancy Buderer, MS
Biostatistician and Research Consultant
nancy@budererdrug.com
6. RE: small sample size with odds ratios

Recommend
Eric Siegel
Posted 09-08-2016 18:50
If n=5 cases is adequate for any group (or bin), it is adequate for the reference group. Being selected to be the reference group does not confer any special privileges or restrictions on the sample-size front. So the question becomes, is n=5 cases adequate for any group?

In section 6.6 of his Statistical Rules of Thumb, Gerald van Belle cites a simulation study by Peduzzi et al 1996, the conclusion of which was that one needs 10 expected events per variable in one's logistic-regression model. In a univariable logistic regression where the single variable is binary (such as treatment versus placebo), the 10 expected events per variable becomes 5 expected events per treatment arm. Since being a case seems to be the event of interest, it would appear that having n=5 cases in the group is the bare minimum considered adequate for univariable logistic regression, and inadequate for multivariable logistic regression.

However, sometimes arguments like that don't work on investigators. What you may have to do in the end is, show the investigator how the results change if you re-define the reference bin to be 0+1 or 0+1+2 instead of 0 alone.
------------------------------
Eric Siegel, MS
Research Associate
Department of Biostatistics
Univ. Arkansas Medical Sciences
7. RE: small sample size with odds ratios

Recommend
Emil Friedman
Posted 09-09-2016 13:26
Why create bins at all? Why not just use logistic regression with score a continuous independent variable?

Or, if group is the independent variable, you have a standard Y by X with categorical X (group) and continuous (score) Y.
------------------------------
Emil M Friedman, PhD
emilfriedman@gmail.com
http://www.statisticalconsulting.org

Discussion: View Thread

small sample size with odds ratios

Nancy Buderer09-08-2016 14:33

Michael Maranda09-08-2016 14:47

Walter Flom09-08-2016 15:05

Matt Jans09-08-2016 15:06

Stephen Simon09-08-2016 15:38

Eric Siegel09-08-2016 18:50

Emil Friedman09-09-2016 13:26

1. small sample size with odds ratios

2. RE: small sample size with odds ratios

3. RE: small sample size with odds ratios

4. RE: small sample size with odds ratios

5. RE: small sample size with odds ratios

6. RE: small sample size with odds ratios

7. RE: small sample size with odds ratios