What you need is the "blood from a turnip test." There are lots of
alternative ways of analyzing this data set, and you need to convince
your client of the folly of his/her approach given the large number of
better alternatives.
There are several ways to do this. One is to show the asymptotic formula
for the standard error for log odds ratio for a two by two table:
sqrt(1/a+1/b+1/c+1/d) where a-d are the cell counts. The precision is
clearly dominated by the smallest cell count. Set up a spreadsheet with
this standard error and show how the standard error is pretty much the
same bad value when you increase any of the other cell sizes, but when
you increase the 5 cell, it drops dramatically.
You could compute the effective sample size as the harmonic mean of the
four cell counts and compare that to other harmonic means using other
reference categories. By insisting on using the zero bin as a reference,
your client is reducing the effective sample size by a factor of at
least a hundred. You could argue that this is an ethical
violation--collecting thousands of observations and then frittering them
away with an analysis that would have had the same precision if you had
collected only a few dozen observations.
You could run a sensitivity analysis where you change one of the case
values and show how sensitive your results are to the change of a single
value out of thousands. This could also be easily set up in a spreadsheet.
You could also argue that sample sizes of 5 are unpublishable. We all
know that's a white lie, but I'm not above using white lies to talk
someone out of a bad data analysis choice.
Since there are a gazillion alternatives that are better, I hesitate to
suggest any, but one possibility is to swap the variables so that the
bins become your outcome variable. Then you can use ordinal logistic
regression, which is going to be fairly insensitive to the small cell
counts in one of the bins. Another possibility is to fit some type of
spline model which would allow you to show graphically what the
predicted log odds would be at 0 and how those log odds change as the
original variable (or the variable transformed to bin number) changes.
Again, the spline would be better because the prediction at zero will
rely on the very reasonable assumption of continuity and allows you to
use information from the values close to the reference category to
improve precision. Every approach has disadvantages, of course, but it
would be hard to argue that any disadvantage is as bad as having a cell
size of 5.
Steve Simon, blog.pmean.com
------Original Message------
An investigator has developed a scoring system to quantify a certain behavior (interval scaled from 0 to 1000, but not normally distributed) that she'd like to retrospectively test for association with a binary outcome. The data set has roughly 1500 cases (the outcome of interest) and 150,000 randomly selected controls (did not have the outcome of interest). We've grouped scores into 10 bins, using 0 as the reference bin. The rationale for 0 as the reference bin is this represents patients who did not exhibit the behavior at all. The other bins represent increasing levels of the behavior. (The way the scale was developed, it does not make sense to use it as a continuous variable)
Here's my problem: there are only 5 cases in the reference bin. With so few cases as the reference, I'm concerned that the odds ratios are unstable. For instance, the odds ratio for the outcome among patients with low scores compared to the reference bin is 113 (CI: 47 to 275) and the odds ratio for the patients with the largest scores compared to the reference bin is ">999.999 (CI: 577 to 999.999)". (I assume SAS prints 999.999 as default when the values are just so too big to fit another digit in the output). I'm thinking the solution is to increase the reference bin to include subjects with a least a low level of the behavior in order to increase the sample size of the reference bin and make the odds ratios more stable, but the investigator is really interested in 0 alone as the reference. I appreciate any ideas or citations that might help support the idea that n=5 is adequate or inadequate for a reference group.
--
Nancy Buderer, MS
Biostatistician and Research Consultant