I wrote this yesterday around 2pm EDT, then hit "send," but it does not seem to have gone out.
==========================
So far, this is all reasonable advice, but let me contribute this general sermon:
When the statistical plan gets complex like this, only a custom simulation study will handle the sample-size questions in a way that will tell savvy reviewers that a solid professional statistician truly collaborated in designing the study and writing the proposal. Yes, this can be a lot of work. Better investigators will understand and support it; the best will expect it. Those who just want a quickie, pedestrian stat considerations section for a complex problem will be nothing but trouble in the long run and the statistician should try to politely disengage--for the sake of both parties.
I find it is usually possible to link each key research question to a single estimate and statistical interval (or plot of posterior distribution) that will quantify the "oomph" that answers the question directly (using that wonderful term coined by Stephen Ziliak and Deirdre McCloskey; see Porter reference below).
This will come from using a given statistical model and method, frequentist, likelihood, or Bayesian.
Accordingly, the stat planning (including sample-size) questions become:
(1) How biased, if any, is the given estimate? Is the degree of bias large enough to matter or will the hypothesized "oomph" of the statistical effect overwhelm it?
(2) How large might the sampling error (e.g., standard error) be? Given the hypothesized "oomph" factor, is this large enough to really matter?
Note that in frequentist-land, (1) and (2) can be combined via the the mean squared error of the estimate vs. the "true" point value. For example, if the true adjusted odds ratio is conjectured to 2.50, does it really matter if its MSE will almost surely be less than 0.10 for a total sample size of N versus less than 0.07 for a total sample size of 2N? 0.10 vs 0.07 may not matter much, but N versus 2N may matter a lot in terms of cost and feasibility.
(3) How well will the associated statistical interval inform us about the true value for the parameter? In other words, how tight will the interval (or posterior distribution) tend to be? What is the distribution of the key limit (lower or upper) of the interval?
(4) Answering (3) can include the estimated statistical "power." For example (frequentist), if you hypothesize that the given parameter, theta, exceeds some null value, theta_0, then the power is just the probability that the lower confidence limit for theta exceeds theta_0. For equivalence testing, this is the probability that entire interval will fall inside the prescribed equivalence region.
Note: My emphasis on using estimates and intervals (and, thus, avoiding p-values unless "forced" to use them) is hardly new or original. Plus, associated with this is my disinterest in basing sample-size analyses on classical power computations, which are too often nothing more than statistical fables designed to appease today's typical grant or article reviewer.
Three quick reads; their content is a good as their titles:
Porter, T. M. (2008). Signifying little. Science, 320(5881):1292.
http://www.sciencemag.org/content/320/5881/1292.full.pdf. This is a book review of The Cult of Statistical Significance How the Standard Error Costs Us Jobs, Justice, and Lives by Stephen T. Ziliak and Deirdre N. McCloskey University of Michigan Press, Ann Arbor, 2008. 348 pp.
Connor, J. T. (2004). The value of a p-valueless paper. Am J Gastroenterol, 99(9):1638-40.
http://www.nature.com/ajg/journal/v99/n9/full/ajg2004321a.html
Bacchetti, P., Deeks, S. G., and McCune, J. M. (2011). Breaking free of sample size dogma to perform innovative translational research. Science Translational Medicine, 87(3):23.
http://stm.sciencemag.org/content/3/87/87ps24.short Such thinking needs to be stressed anew in our teaching, but that's another sermon.
Thanks for listening.
-------------------------------------------
Ralph O'Brien
Case Western Reserve University
-------------------------------------------
Original Message:
Sent: 08-18-2011 15:10
From: Scott Berry
Subject: Power -- ITT and attrition
Agree with Dr. Lesser completely. In cases where the treatment of LTFU are more complex, like continuous cases, frequently simulation is the way to understand the impact. You could set up a simulation of subjects with different drop patterns and the final analysis you will do -- however you impute (or multiply impute subjects) to understand the impact to power.
This is certainly much more work... (but you can simulate missing not at random, but a function of outcome, etc)
Then of course, you may be more than half way to doing a flexible sample size to escape the problems with fixed sample sizes and power!
-------------------------------------------
Scott Berry, PhD
Statistical Scientist
Berry Consultants
-------------------------------------------
Original Message:
Sent: 08-18-2011 14:37
From: Martin Lesser
Subject: Power -- ITT and attrition
In my opinion, ITT sample size calculations are almost always done incorrectly (and I am guilty of
that crime, also). Typically, if one does a "usual" calculation, whether for a longitudinal or fixed time study (it doesn't matter), the common practice is to inflate the required n to allow for drop out, attrition, etc. However, this approach assumes that the drop outs will not be included in the analysis. In ITT, the drop outs are supposed to be included, even if the outcome is assumed to be a treatment failure.
A more correct approach (illustrated with a binary response endpoint). In your power calculation suppose you assume that in the control group (C) the response rate is 30% and that in the intervention group (I) the rate is to be 50%. nQuery says that for 80% power, you need n=93 per arm (chi square, a=0.05). You then assume that 20% of the I subjects drop out and revert to control, so that for those 20%, their true response rate is 30% (it could be higher if they benefit from some of the I dose that they got...). Then, the true response rate in the I group is the weighted percentage: (.2x30%) + (.8x50%) = 46%. You now have to re-do the calculation comparing 46% to 30%, which yields n=144 per arm. (This example was made simple by assuming that only I subjects drop out; you could do the same kind of weighted calculation if C subjects also dropped out. You just need to make assumptions about d/o rates and response rates for the d/o's.)
Of course, you can make other types of assumptions about what the response rate is for a dropout. Also, it's a bit more difficult to do this with continuous outcomes.
Moral of the story: Just increasing sample size to account for attrition does not deal with ITT. In fact, what one is doing is stating the sample size for the Per Protocol analysis. One must factor in the weighted outcome corresponding to what happens to drop outs in each arm.
-------------------------------------------
===================================
Martin L Lesser, PhD, EMT-CC
Director and Investigator,
Biostatistics Unit,
Feinstein Institute for Medical Research
Professor, Dep't of Molecular Medicine,
Hofstra North Shore-LIJ School of
Medicine
Chair, IRB Committee "B"
Mailing Address:
Biostatistics Unit
Feinstein Institute for Medical Research
North Shore - LIJ Health System
350 Community Drive
Manhasset, NY 11030
Phone: 516-562-0300
FAX: 516-562-0344
-------------------------------------------