Dear Colleagues,
The ASA Statistical Learning and Data Science Section is pleased to announce its November webinar, featured by Dr. Nathaniel (Nate) O'Connell from Wake Forest University. Dr. O'Connell will discuss about practical guidance for prediction model development with limited sample sizes. Hope to see you there!
Title: A Comparison of Methods of Cross-Validation for Small Data – Practical Guidance for Prediction Model Development with limited sample sizes
Speakers: Dr. Nathaniel (Nate) O'Connell, Department of Biostatistics and Data Science, School of Medicine, Wake Forest University
Date and Time: November 19, 2024, 1:00 to 2:30 pm Eastern Time
Registration Link: ASA SLDS Webinar Registration Link [eventbrite.com]
Abstract: This work was motivated by real-world collaborations with numerous clinical investigators seeking to develop prediction models. Conventional recommendations for prediction model development suggest researchers have large amounts of data – thousands or tens of thousands of observations – for training, testing, and validation of prediction models. In healthcare research, this often simply is not practical. Data can be expensive and time-consuming to collect, and sometimes in the case of rare diseases, large data simply isn't available. But this does not mean we can't explore development of prediction models. Cross-validation (CV)– a broad term for resampling and testing from out existing data set - can be used to estimate performance metrics of our prediction model. Several common methods of CV exist, and recommendations on what to use can vary in the literature. In this study, we leverage 40+ large open-use datasets from OpenML and perform a benchmarking study to compare common methods of CV on small datasets with 100-500 observations and how estimates from CV generalize to the broader population from which the subsample represents. Across datasets with varying degrees of outcome class imbalance and N:P ratio, we compare common methods of binary classification including Logistic LASSO and Random Forests, terms of mean/median bias and variance of bias with respect estimates of ROC AUC, precision recall AUC, and Brier scores. The results of this work demonstrate practical guidance based on empirical results on how your choice of CV method impacts the generalizability of your results, and which method(s) of CV (and values of 'k' for k-fold) are optimal.
Presenter: Dr. Nathaniel (Nate) O'Connell is an assistant professor in the Department of Biostatistics and Data Science at the Wake Forest University School of Medicine. He earned his PhD in Biostatistics from the Medical University of South Carolina in 2018, and joined the faculty at Wake Forest shortly after. Dr. O'Connell is a collaborative biostatistician and data scientist, with most of his research application focused in the medical disciplines of cancer, pediatrics, neurology, emergency medicine, and EHR data analysis. With experience developing clinical prediction models with various types of data and sample sizes in these domains of medicine, his methodological research interests are motivated and guided by his practical experience. In his own research endeavors, Dr. O'Connell is largely interested in the practical development and implementation of prediction models using machine learning – assessing optimal approaches and strategies for implementing ML models with medical research data – that will ultimately make for easier adoption of developed models into practice.
------------------------------
Zhihua Su, PhD
Associate Professor
Department of Statistics
University of Florida
zhihuasu@stat.ufl.edu------------------------------