Discussion: View Thread

  • 1.  sample size calculation for prediction

    Posted 08-19-2011 15:04
    Dear All,

    I am new to this e-group.  I have been following the sample size discussion re ITT and attrition.  Very interesting. The recent NAS report on missing data in clinical trials notes that this is an important area for research. 

    I was curious to know thoughts on sample size calculations when the aim to is building a prediction model/algorithm.
    Seems to be a dearth of literature on this topic, but I may be wrong.

    All the best,

    Dan

    -------------------------------------------
    Daniel Scharfstein
    Professor of Biostatistics
    Director, Graduate Program
    Johns Hopkins School of Public Health
    -------------------------------------------


  • 2.  RE:sample size calculation for prediction

    Posted 08-23-2011 17:08
    Hi, Daniel

    If, by prediction, you mean classification, then people associated with Edward R. Dougherty at Texas A&M have published on sample size in this context.  Below are two that appeared in 2005 in Bioinformatics:

    (1)  How many samples are needed to build a classifier: a general sequential approach.  Fu WJ, Dougherty ER, Mallick B, Carroll RJ.  Bioinformatics. 2005 Jan 1;21(1):63-70. Epub 2004 Aug 5.  PMID: 15297303

    (2)  Optimal number of features as a function of sample size for various classification rules.  Hua J, Xiong Z, Lowey J, Suh E, Dougherty ERBioinformatics. 2005 Apr 15;21(8):1509-15. Epub 2004 Nov 30.  PMID:  15572470
    Also, search Dougherty's name in PubMed and browse some of the other titles that come up.

    However, if, by prediction, you mean the prediction of an individual's risk of coming down with something undesirable at a future time, then that's an area in which (a) people are coming to realize that classification algorithms are inadequate to the task, and (b) they've been trying to develop new methods, and indeed, new metrics of performance, in order to get a better handle on individual-risk prediction.  Two good papers to look for in this context are as follows:

    (1)  Use and misuse of the receiver operating characteristic curve in risk prediction.  Cook NRCirculation. 2007 Feb 20;115(7):928-35.  PMID:  17309939

    (2)  Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond.  Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS.  Stat Med. 2008 Jan 30;27(2):157-72; discussion 207-12.  PMID: 17569110

    Additionally, Margaret Pepe has published commentary on both these papers as well as contributed relevant methodology of her own.

    Have fun

    -------------------------------------------
    Eric Siegel
    Boistatistician
    Univ of Arkansas for Medical Sciences
    -------------------------------------------