Upstat 2024 Abstracts for Talks

Parallel Talk Abstracts

Ordered by presenter last name. 

§ indicates eligibility for student awards.

** indicates invited talks.

 

AI Assisted Exploration of RPD Traffic Stop Data **
Irshad Altheimer, R.I.T. Department of Criminal Justice and Center for Public Safety Initiatives
Jillian Antol, R.I.T. Department of Criminal Justice and Center for Public Safety Initiatives
Ava Douglas, R.I.T. Department of Criminal Justice and Center for Public Safety Initiatives

Police traffic stop data presents an opportunity to examine the administration of justice for a common police practice and provide a window for exploring disparities in police contacts. We utilize ChatGPT 4.0 to examine data from 40,000 RPD traffic stops. We discuss the key findings, our overall impressions of using AI generated tools for research purposes, and the broader implications of AI research tools for action researchers seeking to solve problems associated with violence and justice.

 

Enhancing Statistical Learning: Leveraging Kolb's Cycle in CourseKata Curriculum Analysis
DataFest Team Positiv Correlation: 
Jonathan Bateman, Cara Stievater, Nishka Desai, Vivian Hernandez, Jack Cawthon, Undergraduate Students, Rochester Institute of Technology

This report examines how CourseKata can enhance the student learning experience in statistics by employing Kolb's cycle as a framework. Kolb's cycle comprises four stages: concrete learning, reflective observation, abstract conceptualization, and active experimentation. We investigate the completion of this cycle and its impact on student performance. Through hypothesis testing and data analysis, we explore the relationship between students' engagement, review habits, psychological constructs, and performance. Using statistical tools like Python and JMP Pro, we analyze end-of-chapter review scores and other relevant variables. Our findings suggest potential improvements to CourseKata's curriculum, emphasizing practical experimentation and real-world application to enhance student learning outcomes. Additionally, we propose further investigation into the alignment of CourseKata's curriculum with the stages of Kolb's cycle for comprehensive learning assessment and refinement.

 

Differential abundance analysis method for identifying biologically relevant microbes §
Robert Beblavy, Graduate Student, University of Rochester

Identifying biologically relevant microbes has been a crucial task in human microbiome studies, as different microbial communities have been found between healthy and diseased groups in almost every investigated disease. Numerous computational and statistical methods have been proposed for this task, commonly known as differential abundance analysis. However, existing methods ignore the fact that different microbes can perform the same or very similar function, known as functional redundancy, and the contribution of unknown microbes in functional shifts. In targeting microbes for therapeutic applications, it is crucial to determine differentially abundant (DA) microbes that are associated with difference in function across different groups. We introduce a two-stage procedure involving L1-penalized probit regression models with a linear constraint on the regression parameters and a debiasing procedure to identify DA taxa given DA functions/pathways. We performed simulation studies to illustrate consequences of ignoring functional redundancy and the contribution of unknown microbes and assess the performance of the proposed method. We also applied the proposed method to a real microbiome dataset and showed substantial difference between the typical taxonomic-based DA analysis and the proposed method.

 

Community-Engaged Data & Analytics Education: Empowering the Student & the Community Organization
Travis Brodbeck, PhD, Associate Professor, Department of Mathematics, Siena College

Community-engaged learning matches an opportunity to support a local organization with the skills of undergraduate students. Leveraging this socially-oriented approach to learning with a quantitative project enables students to enhance their data management and analysis skills while supporting the needs of the community. Through offering community-engaged data projects, students gain firsthand experience working with real data, performing analysis, and communicating findings. Data-based community engagement projects can emerge in any style of class and can be at any depth.

Rather than guiding students through an academic or textbook example of a data analysis problem, a tangible and relevant community-engaged challenge potentially provides students better insight into the problem and potential solutions. The benefits of community engaged data projects go beyond teaching the course concepts and provide students opportunities to connect with residents, organizations, and broaden their perspective through diverse experiences. 

This session discusses four examples of using community-engaged data and analysis projects to enhance student learning that give back to the local organizations. Data collection, analysis, and presentation are skills that can be taught in an introductory accounting course, in an advanced accounting course, and in research methods courses. Across a collection of projects overtime, instructors will be able to develop a portfolio of projects to offer to nonprofit organizations to streamline the project development process leading to a more comprehensive community-engaged experience for students and community members.

 

Analyzing the Impact of the Learning Assistant Program on Student Success in Introductory Physics Courses §
Cameron Bundy, Graduate Student, Rochester Institute of Technology

Compared to lecture-based teaching, active learning teaching methods have been shown to be associated with notable benefits, including reductions in course failure rates and increased graduation rates. The Learning Assistant (LA) program, widely implemented at RIT and across the US, involves students (LAs) aiding their peers through evidence-based collaborative activities in STEM courses. We can measure the effectiveness of the LA program by calculating the rate at which students earn grades of D, F, or withdraw from the course (DFW rates), or by computing the six-year graduation (SYG) rates for cohorts of students. While the LA program is extensively utilized, its specific impact on students from marginalized identities at RIT remains relatively unexplored. In this study, we use logistic regression and hierarchical linear models (HLM) to assess the impact of LAs in introductory physics courses. In both statistical modeling approaches, we found a significant reduction in DFW rates and enhanced SYG rates. Notably, the positive effects are particularly pronounced for non-male, and first-generation students. These findings align with similar studies on university LA programs, reinforcing their validity. Utilizing both logistic regression and HLM approaches, we analyze the LA program's effectiveness through multiple lenses. These results underscore the program's importance and support the case for continued investment at RIT. In an era demanding a robust STEM workforce, the identification of factors promoting student success is crucial. This study highlights the LA program as an alternative pathway toward fostering student success at RIT and other US universities.

 

Improving Federated Learning Robustness towards Security Violations and Data Quality Degradations OR 
How to Learn Better via Extracting Knowledge and Accumulating History? § **
Sergei Chuprov, Graduate Student, Rochester Institute of Technology

In this part of our talk, we highlight our novel approach to enhance Federated Learning (FL) security against local Data Quality (DQ) variations caused by malicious manipulations. This method leverages the inherent FL feature, where only model updates are exchanged, to detect and exclude compromised local units from the learning process. By analyzing the indirect relationship between local DQ and the resulting model updates, we can identify units contributing poor DQ, which might be a result of adversarial attack. We further evaluate the overall quality of updates and extract knowledge on the local units through clustering of these updates. Based on this combined analysis, we leverage our Reputation and Trust mechanisms to accumulate history and categorize units as trusted or not. Updates from untrusted units are excluded from the aggregation, and the local units provided them are denied access to the global model update, preventing potential inference attacks. We validate the effectiveness of our approach in both normal and data poisoning scenarios using real-world datasets, demonstrating its ability to improve both FL robustness and security.

 

Exploring Statistical Ethics with Students
Caitlin Cunningham, Associate Professor, Department Chair of Mathematics, Le Moyne College

Modern statistics education increasingly needs to address the importance of scientific and statistical ethics. In our capstone course for the applied statistics minor at Le Moyne College, we spend three weeks exploring ethical issues in statistical research. This discussion begins by exploring the reproducibility crisis in psychology and the Many Labs projects run by the Open Science Framework. Students are encouraged to brainstorm possible causes of this crisis, and to explore the systemic causes of research fraud. We examine the concept of p-hacking and explore the replicability of p-values with a simulation exercise in class. Students develop their own sense about the reliability of p-values and learn about the concept of pre-registration, reading other pre-registrations and developing their own. To end the unit, students work in small groups to examine specific cases including Amy Cuddy, Brian Wansink, and Diedrich Stapel, and the class discusses the scale, significance, and consequences of their potential fraud. While most courses don’t have the luxury of this much time, this talk will provide a number of resources including readings, videos, and podcasts that can be used in class or shared with students. In addition, it will share in-class activities, homework assignments, and possible group projects that can be used, modified, or adjusted depending on an instructor’s goals and the time available.

 

Building a Data Science Collaboratory
William Cipolli, PhD, Associate Professor, Department of Mathematics, Colgate University
Joshua Finnell, Director of Research & Scholarly Initiatives, Colgate University

The Data Science Collaboratory at Colgate University in 2018 to address the gap in statistical research support at smaller institutions. During this work, we discovered several opportunities to support the teaching and learning of statistics through the pedagogical lens of the liberal arts, which the work of this grant aims to address. We have worked to (1) develop software that addresses barriers to teaching and learning quality statistical practice and accompanying materials for guiding the learning of foundational concepts in statistics; (2) Create a series of advanced case studies that instructors can use to guide quantitative analyses in courses that use statistics; and (3) Cultivate a multi-institutional community that fosters opportunities for cross-institutional student collaboration and learning. We will discuss the interactive R Shiny web apps we developed to conduct technically sound statistical analyses that empower instructors, students, and the broader community. These free resources provide exposure to current quantitative research and equip users with the skills to select, apply, and interpret appropriate techniques in real-world contexts, regardless of their prior mathematical or coding experience. We believe this approach will make statistics and quantitative research more accessible and approachable for all learners.

 

Probability Playground: Five Years On §
Adam Cunningham, Undergraduate Student, University at Buffalo

Every year several hundred thousand university students take classes in statistics and probability theory in departments of mathematics, statistics, economics, business, social science, and psychology. These classes typically introduce students to a range of “named” or “special” distributions which describe situations commonly encountered in statistical applications. An understanding of these distributions is therefore a fundamental part of students’ statistical education.

Probability Playground (http://www.acsu.buffalo.edu/~adamcunn/probability/probability.html) is a website designed to introduce the most commonly encountered univariate probability distributions and their relationships. The website provides a dynamic and interactive environment for exploring how the shapes of distributions relate to their parameters, how distributions are visually related through transformations and limiting cases, and how their visual form relates to their data-generating processes. Unique features include the interactive loading of examples, parameter editing through either text or sliders, direct interactive editing of distribution means and variances, manual and automatic axis scaling, interactive graphing of distribution relationships, animated simulations of data-generating processes, and accessible navigation using either a mouse or keyboard.

A prototype of Probability Playground was first presented at UPSTAT-2019. Over the intervening years it has evolved into a robust and extensive resource for visually illustrating probability distributions and their relationships. The website has been widely adopted, with users from 140 countries over the last year and over 2,000 users each month. It has been tested to be compatible with all major operating systems, web browsers, and devices, providing an “off-the-shelf” solution for educators looking to incorporate interactive and exploratory learning into their classrooms.

 

Exploring race and crime in a rustbelt town using AI assisted analysis **
O. Nicholas Robertson, PhD, Associate Professor, Department of Criminal Justice, Rochester Institute of Technology
Venita D’Angelo, Graduate Student, Department of Criminal Justice, Rochester Institute of Technology 

Using 30 years of aggregated data from the NY state Division of Criminal Justice Services, analysis was conducted using traditional tools to understand racial disparities in arrests as the town of Irondequoit, NY became more diverse. This presentation builds upon previous work by using AI to improve data visualization, and to explore suggested analyses of the data.

 

Improving Coastal Adaptation Decision-Making Using a Robust Decision Criterion §
Carolina Estevez, Graduate Student, Rochester Institute of Technology

Coastal areas can be significantly impacted by rising sea levels in many ways, including infrastructure, wetland, and dryland losses, coastal erosion, flooding, saltwater intrusion, and economic instability in vulnerable regions. Coastal adaptation models are useful tools for developing strategies to mitigate the risks associated with sea level rise. In this work, we use the MimiCIAM, coastal impacts and adaptation model, which considers future scenarios of sea level rise as one of its input variables. Like many adaptation models, MimiCIAM assumes that coastal decision-makers will act optimally to safeguard against future coastal hazards by minimizing a decision criterion. However, this representation likely misses risk aversion and other robustness elements of real on-the-ground decision-making. We address this issue by introducing a robust decision-making criterion into MimiCIAM. We implement decision-maker uncertainty in future sea level rises and use this framework to compute the economic regret of each candidate adaptation decision as our new decision-maker criterion. We find that using regret leads to lower coastal adaptation costs and damages as compared to a strategy that minimizes the expected costs and damages. Additionally, there is a global trend of using Retreat over Protect, and implementing regret results in higher protection levels caused by decision-makers aversion. By integrating regret into MimiCIAM decision criteria, we improve realism and address the risk aversion preferences of decision-makers.

 

Hypothesis Testing about the Pearson and Spearman Correlation Coefficients: Navigating Pitfalls, Software Anomalies, and Alternative Approaches §
Mengyu Fang, Graduate Student, Roswell Park Comprehensive Cancer Center

Evaluating the correlation coefficient's significance is a cornerstone of statistical analysis. In this study, we evaluated the Type I error control of the commonly used tests found in most statistical software packages for testing the hypothesis on H0: ρ = 0 vs. HA: ρ > 0 based on the two widely employed measures, Pearson and Spearman correlation coefficients. Our findings reveal that conventional tests often significantly overestimate Type I error rates, a flaw that persists even under the assumption of bivariate normality. To address the issue, we proposed a robust permutation test for testing the hypothesis H0: ρ = 0 based on an appropriately studentized statistic by extending DiCiccio and Romano’s method. A comprehensive set of simulation studies with a range of sample sizes and various underlying distributions were conducted to show that the proposed test exhibits a robust Type I error control, even when the sample size is small. The practical utility of our approach is further underscored through its application in two real-world case studies, showcasing its potential to enhance the rigor of statistical inference in empirical research.
Additional Authors: Dr. Alan D. Hutson, Dr. Han Yu from Roswell Park Comprehensive Cancer Center

 

Understanding the Impact of Structural Uncertainty in Sea Level Rise Models on Adaptation Costs and Strategies §
Kelly Feke, Undergraduate Student, Rochester Institute of Technology

Sea level rise (SLR) is a major consequence of climate change, posing significant threats to coastal communities and infrastructure worldwide. Adaptation actions must be taken to protect these communities from SLR, and these actions can have sizable costs. There are many climate models predicting the global mean sea level rise (GMSL). However, the varieties of models and their associated uncertainties lead to uncertainty in future sea level rise predictions, which influence the actions that should be taken to mitigate the negative impacts of local sea level rise. Although uncertainties in projections of SLR have been well-studied, the variability of adaptation costs stemming from model structural uncertainty is relatively understudied. Here we characterize structural uncertainties in SLR across different global climate models and their impacts on predicted coastal damages along the U.S. Gulf of Mexico Coast. Using the integrated modeling framework Mimi and models MimiCIAM and MimiBRICK, the range of adaptation costs are calculated from SLR predictions of 20 different models from the Coupled Model Intercomparison Project (CMIP6). We show how higher impact scenarios have wider uncertainty ranges in adaptation cost and strategy. The heightened variability makes it uncertain how severe the impacts of SLR will be and what solutions can effectively mitigate them. This drives the need to quantify the uncertainties associated with SLR models and evaluate their impacts on adaptation decisions.
Additional Authors: Tony Wong, RIT

 

Improving Students' Perspectives on Statistics: From Loathing to Loving
Kylee A. Healy, Undergraduate Student, Niagara University

Teaching statistics provides the opportunity to share valuable information with students across multiple majors. However, students are often apprehensive about learning about math in general, especially statistics. This poses a potential problem for educators when it comes to encouraging students to continue to explore this expanding field. The methods employed while teaching statistics have remained consistent throughout the many years of the field, while other subjects have adapted to the constantly changing classroom environment. Research regarding these techniques has resulted in varied conclusions with some proposing the potential for new methods to be employed to promote student success and interest. Incorporating humor, real-life examples, and active learning within the classroom are just a few of the adjustments that educators can make while designing their lesson plans. Prior studies have shown positive feedback from students after teachers use these tactics. In this presentation we will review the statistics education literature and critically evaluate teaching techniques. Furthermore, we will suggest new ways to keep students interested and motivated in the classroom. Although statistics is a riveting topic to many, these modifications may allow more students the opportunity to engage in, and even enjoy, the benefits of understanding statistics.
Additional Authors: Joseph D. Martino, and Susan E. Mason

 

A Bayesian framework for medical product safety assessment using correlated spontaneous reporting system data §
Xin-Wei Huang, Graduate Student, University at Buffalo

Determining adverse events (AEs) of concern of drugs or other medical products from spontaneous reporting system databases is a core challenge of pharmacovigilance. These databases catalog reports on a multitude of AEs and drugs. However, the similarity and interaction between drugs and AEs often induce correlations in the recorded data. Existing statistical approaches for pharmacovigilance, including the likelihood ratio test-based methods, do not adequately address these correlations, potentially leading to suboptimal assessments. We propose a formal multilevel Bayesian framework that can acknowledge these correlations more faithfully. Our approach can handle zero-inflated report counts, rigorously controls the (positive) false discovery rate when used for AE signal discovery, and through appropriately articulated shrinkage prior layers aid statistically stable and interpretable inference. Our implementation uses state-of-the-art MCMC sampling (via stan). Extensive simulation experiments document notable superiority of our approach over other existing approaches. We present a case study involving statin drugs. An R package implementing our methods is developed.

 

Mixed Models and Imputation: Tackling the DataFest Dataset
DataFest Team Big Data Energy: 
Daniel Illera, George Daher, Isabella Wang, Undergraduate Students, University of Rochester

One of the goals of CourseKata, an online textbook service, is to “improve important learning outcomes for all students” (https://coursekata.org/about). We analyzed data collected from their online textbooks with the intention of finding relationships in the data between student self-reported metric of understanding, in order to potentially evaluate student success and provide suggestions to CourseKata on how to best improve their product. We used a Cumulative Logit Mixed Model to model this response as a function of how useful each student found the material and the number of questions each student answered correctly in each, while accounting for both random effects within classes and students as well as fixed effects and interactions involving institution, the edition of the textbook used and the chapter the student was completing.

Additionally, we found evidence that the missingness in the data is driven by the class the student is in, which is why we argued that this data is Missing At Random. Consequently, we used Predictive Mean Matching within the framework of multiple imputation to fill in the missing values using most of the variables in our dataset, then we compared the results to those from complete case analysis. We found that the confidence intervals of one of our variables became tighter, showing some benefit to imputation here.

All in all, we did not find any significant relationships between the self-reported metric of understanding and the number of questions answered and how useful a student found the topic.

 

Detection of interpretable communities in multilayer biological networks using a penalized sparse factor model §
Saiful Islam, Graduate Student, Institute for Artificial Intelligence and Data Science, University at Buffalo

Data of multilayer biological networks, such as gene expression data from different tissues and populations, and multi-omics data, are increasingly available. There exist dozens of methods to detect communities (i.e., densely connected groups of nodes) in multilayer networks, which may inform us of, e.g., sets of genes sharing biological functions. However, a fundamental limitation of most of these methods is interpretability. These methods often yield communities that span some nodes and some layers but in an inhomogeneous manner, making it difficult to interpret the results, hindering insights into biological processes underlying the observed data. Here we propose an approach to identify interpretable semi-pillar communities within multilayer correlational biological networks, particularly focusing on application to multilayer gene co-expression data. By leveraging sparse factor models, we incorporate the layer-specific and community-specific information, and gene information into the loading matrix, enhancing the interpretability of community structures and reducing the computational complexity. By optimizing the likelihood function using expectation-maximization method, we aim to uncover semi-pillar communities. We validate our method on synthetic data that is generated from a factor model with planted semi-pillar communities. Application to empirical gene co-expression data is forthcoming. The detection of semi-pillar communities in multilayer gene co-expression networks will facilitate intuitive understanding and help domain experts to focus their in-depth investigation of biological systems.
Additional Authors: Naoki Masuda (Dept of Mathematics & Institute for Artificial Intelligence and Data Science, SUNY at Buffalo)

 

Bayesian Infinite Mixture Models for Clustering Exposure Data §
Jonathan Klus, Graduate Student, University of Rochester

Dirichlet process mixture models (DPMM) are an attractive way to avoid specifying the number of groups in a mixture when this is not known a priori. The Dirichlet process prior treats the number of groups as unknown, which allows for improved uncertainty quantification relative to many other methods. However, this choice of prior complicates estimation and posterior inference. We explore the implications of using different algorithms to fit a DPMM, depending on the chosen variance structure and whether priors for the mean and variance are conjugate. We evaluate the sampling performance of the algorithms under different model assumptions (equal variance, diagonal variance, unstructured variance-covariance matrix) for clustering problems of varying difficulty in simulations, and make recommendations for preferred algorithms based on properties of the data to be analyzed.

 

Statistical perspectives on the use of neural networks for clinical prognosis **
Douglas Landsittel, Professor, Department of Biostatistics, SPHHP, University at Buffalo

Neural networks (NNs) have become extremely popular for complex (but largely deterministic) problems in diagnostic imaging, character recognition, large language models, and other so-called AI applications. However, the complexity of NNs has yielded far less utility for clinical prognosis and predicting treatment response, where the focus is on future outcomes and characterizing individual variability that depends more on largely unknown underlying stochastic models. These types of problems are far more challenging for AI methods with an inability to easily control the underlying model complexity and parameterization. Our past work investigated one such limitation: overfitting. More specifically, our group developed a measure called the null degrees of freedom (NDF) to estimate the model complexity of a NN under the null of no association. The NDF is estimated through the simulated chi-squared distribution, and associated mean, of the likelihood ratio statistic for testing model independence. For the simple case of a binary outcome and only binary predictors, the NDF of a single hidden layer NN simplifies to the total of predictor variables plus the total number of potential interactions. For continuous predictor variables, we provide an estimated NDF that further accounts for non-linearity. We describe how these findings can be applied to prognostic models for kidney failure in patients with polycystic kidney disease. These approaches should be considered to motivate the application of statistical principles, and the role of statisticians, to help overcome limitations of AI methods and improve clinical prognosis.

 

The Asymptotic Distributions of Generalized Win Odds §
Honghong Liu, Graduate Student, University of Rochester

The win ratio was coined by Pocock and his colleagues (Pocock et al., 2012) to study the treatment effects in heart research. Since then, it has been widely used in many different areas with prioritized multiple outcomes. The win ratio is the probability of win over loss (Oakes, 2016). Like the risk ratio in binary outcomes, the win ratio has a wide range and may be very sensitive to the probability of loss. However, a critical concern with the win ratio is the presence of ties, as emphasized by Pocock et al. (2012), who cautioned against its use when ties are significant. In this paper, we define a family of generalized win odds (GWO) which can easily take the tie probability into account. The U-statistics theory is used to study the asymptotic distributions of GWOs. We also develop an analytic method to find the optimal GWO which has the largest power to detect the treatment difference. We illustrate our findings using data from a prostate cancer study.
Additional Authors: Changyong Feng, University of Rochester

 

Impact of COVID-19 on English Football Premier League: Analyzing Rankings and Home-Field Advantage Using New Extended Bradley-Terry Models §
Honghong Liu, Graduate Student, Department of Biostatistics & Computational Biology, University of Rochester

COVID-19 had a profound impact on the format of sports tournaments around the world. Not only were teams impacted by COVID-19 outbreaks, but a significant change was the restriction or elimination of spectators at matches. These circumstances led to fluctuating team compositions and highly variable match results, offering a unique opportunity to investigate the pandemic impact on overall performance as well as help us better understand the home-field advantage. We proposed a new extended Bradley-Terry model. Using data from the English Premier League across 10 seasons, we applied and compared the new model and another two extended Bradley-Terry models. These models were employed to assess changes in team rankings and home-field advantage before, during, and after the pandemic-related disruptions. Our analysis revealed significant variations in team performances and home-field advantage during the pandemic. The findings suggest a noticeable shift in team dynamics and competitive balance, particularly during the seasons played under pandemic restrictions. The study provides insights into the tangible impacts of COVID-19 on English Premier League football, highlighting the pandemic's influence on altering team dynamics and disrupting the traditional advantage of playing at home. These findings offer a valuable perspective for sports analysts, clubs, and governing bodies in understanding and adapting to the challenges posed by such global disruptions.
Additional Authors: Tong Tong Wu, Associate Professor, Department of Biostatistics & Computational Biology, University of Rochester

 

A focus on the strengths of Julia for Reproducibility §
Yang Liu, Graduate Student, Rochester Institute of Technology

This talk explores how Julia, a high-performance language with an intuitive syntax, empowers scientists to achieve significant productivity gains. By bridging the gap between scientific and engineering code, Julia allows researchers to focus on problem-solving rather than grappling with complex programming tasks. The talk will showcase how Julia can deliver speedups of hundreds to thousands compared to traditional scientific computing languages.

Furthermore, Julia's rich ecosystem fosters reproducible research, a cornerstone of scientific progress. The talk will delve into overcoming two key reproducibility challenges: ensuring code can be reliably run by the researcher themselves (self-reproducibility) and by others (cross-reproducibility). Practical examples using Julia's DrWatson library and reactive Pluto notebooks will be provided, equipping researchers with the tools to streamline their workflow and guarantee robust research outcomes.

 

BACKTRACK TO SUCCESS WITH COURSEKATA
DataFest Team Feisty Folk: 
Quan Lu, Gabriel Casselman, Jonah Witte and Benjamin Knigin 
Undergraduate Students, Rochester Institute of Technology

Our analysis of the CourseKata data set provided for the 2024 ASA DataFest aimed to provide solutions that would help improve the student experience of learning statistics. We focused on our analysis of the tendency of students to backtrack within the CourseKata modules and review previous lessons, indicating points of failure within the courses’ educational progression.

 

A Model-Based Clustering Method for High-Dimensional, Dependent Data with Categorical Outcomes §
Samantha Manning, Graduate Student, University of Rochester

A model-based clustering method for high-dimensional, longitudinal data with categorical outcomes via regularization is proposed. The development of this method was motivated in part by a study on 177 Thai mother-child dyads to identify risk factors for early childhood caries (ECC). Another considerable motivation was a dental visit study of 308 pregnant women to ascertain determinants of successful dental appointment attendance. There is no available method capable of clustering longitudinal categorical outcomes while also selecting relevant variables. Within each cluster, a generalized linear mixed-effects model is fit with a convex penalty function imposed on the fixed effect parameters. Through the expectation-maximization algorithm, model coefficients are estimated using the Laplace approximation within the coordinate descent algorithm, and the estimated values are then used to cluster subjects via k-means clustering for longitudinal data. The Bayesian information criterion can be used to determine the optimal number of clusters and the tuning parameters through a grid search. Simulation studies demonstrate that this method has satisfactory performance and is able to accommodate high-dimensional, multi-level effects as well as identify longitudinal patterns in categorical outcomes.

 

Correlation Network Analysis **
Naoki Masuda, Professor, Department of Mathematics & Institute of Artificial Intelligence and Data Science, University at Buffalo

Networks from correlation matrix data have been constructed and analyzed in various domains including in statistical sciences. In this talk, I review pitfalls and recommended practices of correlation network analysis. Then, I will introduce some of my work in this direction such as reference models for correlation matrices inspired by methods in network science, estimation of sparse covariance networks, and clustering coefficients and multilayer community detection for correlation matrices.

 

A new class of odd Lindley type distributions and its applications
Nonhle Channon Mdziniso, Assistant Professor, Rochester Institute of Technology

Probability distributions and their applications continue to be of interest in mathematical statistics because of the convenience that comes with fitting data using parametric models. In this work, we analyzed lifetime data using a new class of odd Lindley type distributions which we developed by considering the T-X family of distributions and the power series distribution. Mathematical properties of this family of distributions are presented. A small simulation study is presented to assess coverage probabilities of estimated parameters for one of its special cases. Application examples based on lifetime data are then used to illustrate its versatility.
Additional authors: Fastel Chipepa, Shahid Mohammad, and Sher Chhetri

 

Evaluation of the TreeScan method for Adverse Event Identification §
Raktim Mukhopadhyay, Graduate Student, University at Buffalo

Adverse events from medications, vaccines, therapeutic biologics, and medical devices are a major public health concern that lead to hospitalizations and fatalities. Continuous monitoring of medical products after their market release is critical for patient and consumer safety. While clinical trial data serves as the primary source of information regarding the safety of medicinal products, we explore alternative data sources and describe a tool to collect data from these various data sources. We briefly review the methods for identification of potential signals with emphasis on the TreeScan method proposed by Kulldorff et. al. (2003). This method serves as a bridge between likelihood ratio test-based methods and pattern discovery. We perform extensive simulations based on the statin drug class from the United States Food and Drug Administration (FDA) Adverse Event Reporting System database to gain better understanding into the advantages and challenges of TreeScan. Its performance is evaluated using metrics such as level, power, sensitivity, and false discovery rate among others. We further demonstrate the performance of TreeScan on a real-world dataset of COVID vaccines obtained using our tool from the Australian Database of Adverse Event Notifications for medicines.
Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323-331.
Additional Authors: Marianthi Markatou

 

Identifying brain network trajectories after traumatic brain injury **
Sarah Muldoon, Associate Professor, Department of Mathematics, Institute of Artificial Intelligence & Data Science & Neuroscience Program, University at Buffalo

Traumatic brain injury (TBI) damages white matter tracts and changes brain connectivity, but how specific changes relate to differences in clinical/behavioral outcomes is not known. Here, I’ll present recent work classifying different patterns of changes in brain network structure after injury in a rat model of TBI. We find that local changes in network structure can be used to define subgroups within injured rats that display different patterns of injury induced change. I’ll then discuss how ongoing work hopes to develop methods to detect different network trajectories as these networks differentially evolve over time and link the different trajectories to clinical outcomes.

 

Unaccounted for Uncertainty: The Significance of Uncertainty in the Optimization of the Kidney Exchange Problems §
Calvin Nau, Graduate Student, Rochester Institute of Technology

The kidney exchange problem (KEP) is an NP-hard problem that seeks to form kidney donation chains and cycles from a pool of non-directed donors (NDD) and incompatible patient-donor pairs (PDP). Often, the KEP is solved assuming a deterministic set of inputs such that the feasible exchanges from NDDs and donors in PDPS to the patient in a PDP are known prior to optimization. A few select works include uncertainty dependent on the donor’s biological characteristics. Regardless, common is the result that few exchanges obtained from optimization translate to transplants. The result is reduced quality-of-health from the patient’s perspective and a large financial cost to the healthcare system. However, this work argues other factors such as physician perceived risk and acceptance threshold contribute a significant amount of uncertainty to solutions. Further, this uncertainty is conditioned both on the characteristics of the kidney being offered and the patient receiving the offer. Therefore, this work proposes a machine learning (ML) architecture that seeks to optimize exchanges while considering uncertainty to achieve optimization results that translate to implementation more often. To develop this methodology, an overview will be given of previous ML architectures that operate over similar graph structures and their relevance to previous KEP formulations will be addressed. Initial results are presented. A discussion of the potential significance of this work will be discussed from the perspective of the public health impact and the optimization field at large.
Additional Authors: Prashant Sankaran, Moises Sudit, Payam Khazaelpour, Katie McConky, Alvaro Velasquez, Liise Kayler; Rochester Institute of Technology, Department of Industrial & Systems Engineering

 

Veracity Analysis: Current Methods and Limitations §
Le Nguyen, Graduate Student, Rochester Institute of Technology

In the modern digital age, we are constantly inundated by information and news stories both fake and real. This problem is so pervasive the World Health Organization has declared the world is currently in an "infodemic" where so much conflicting information makes it hard to be certain about anything. In response, the field of veracity analysis is quickly developing - automated techniques to determine what information is fake and what information is real. In this talk we will go over current methods used in veracity analysis and their limitations.

 

Hybrid Heating Optimization: Balancing Comfort and Cost with Electrical Heat Pumps and Natural Gas Furnaces **
Sydney Pendelberry, Research Computing Facilitator III, Rochester Institute of Technology

The integration of electrical heat pumps with natural gas furnaces presents a sustainable and cost-effective solution for residential heating. This paper explores the optimization of a hybrid heating system that leverages the efficiency of heat pumps for moderate conditions and the high-energy output of natural gas furnaces for extreme cold. Through the development of an algorithm that considers external temperature, internal comfort parameters, and energy prices, we propose a model that dynamically switches between heat sources to minimize operational costs while maintaining occupant comfort. Our study utilizes historical weather data, energy price fluctuations, and thermodynamic principles to simulate the performance of the proposed system across diverse climatic conditions. The results demonstrate significant cost savings and reduced carbon footprint without compromising thermal comfort. This research contributes to the evolving field of smart energy management in buildings, providing valuable insights for homeowners, architects, and urban planners aiming to enhance sustainability and efficiency in residential heating systems.

 

Data Quality Evaluation for Federated Learning (FL) OR 
What to Look at in Order to Improve FL? **
Leon Reznik, Professor, Department of Computer Science, Rochester Institute of Technology

Many science and technology applications employ (1) distributed structures to collect and communicate data from multiple, possibly multimodal sources, and (2) AI and machine learning (ML) techniques to process these data and derive discovery models. Federated Learning (FL) introduces distributed ML training that aims at privacy and security enhancement by keeping training data and deriving models locally, and resource consumption reduction by not transferring training data to the aggregation unit. However, despite introducing local models, conventional FL does not consider their quality and collects all local models from and sends the aggregated model updates back to local units in each training round without considering local unit data quality and security (DQS), that may constitute a major vulnerability when a local unit is already compromised and/or not working properly. We will discuss our approach to DQS calculation and their application in FL. In particular, the composition of DQS to be used, ranging from accuracy to security; their calculus and integration into total DQS indicators that could be employed to improve FL procedures. We will present our software tools and applications, we produced to evaluate data quality and security and data we collected and made available for the community.

 

Assessment of the effectiveness of required weekly COVID-19 surveillance antigen testing at a university
Christopher W. Ryan, Professor, SUNY Upstate Medical University, and Agency Statistical Consulting, LLC

To mitigate the COVID-19 pandemic, many institutions implemented a regimen of periodic required testing, irrespective of symptoms. The sustained effectiveness of this "surveillance testing" is uncertain. I fit a zero-inflated negative binomial model to testing and case investigation data between 1 November 2020 and 15 May 2021, from young adult subjects with COVID-19 in one community. I compared the duration of symptoms at time of specimen collection in those diagnosed via (1) surveillance testing at a university, (2) the same university's student health services, and (3) all other testing venues. The data comprised 2934 records: 394 surveillance testing, 493 student health service, and 2047 from other venues. Predicted mean duration of pre-testing symptoms was 1.7 days (95% CI 1.59 to 1.83) for the community, 1.8 days (95% CI 1.61 to 1.98) for surveillance, and 2 days (95% CI 1.84 to 2.16) for student health service. The modelled "inflated" proportions of asymptomatic subjects from the surveillance stream and the other/community stream were comparable (odds ratio 0.95, P = 0.77; that proportion was significantly higher at student health service compared to surveillance (Chi-square = 12.32, P < 0.001). Surveillance testing at a university detected several hundred people with COVID-19, but no earlier in their trajectory than similar-aged people detected in the broader community. This casts some doubt on the public health value of such programs, which tend to be labor-intensive and expensive.

 

QuadratiK: A Comprehensive Tool for Goodness-of-Fit Tests and Spherical Data Clustering
Giovanni Saraceno, Postdoctoral Researcher, University at Buffalo

The advent of complex data structures in various disciplines necessitates the development of sophisticated analytical tools that bridge the gap between traditional statistical methodologies and contemporary data analysis needs. This work introduces the QuadratiK package, a novel tool implemented in both R and Python, that advances the field of data analysis through the application of kernel-based quadratic distances for goodness-of-fit tests and clustering techniques. Particularly, QuadratiK addresses the challenges posed by spherical data, offering researchers a robust framework for assessing the compatibility of data with assumed probability distributions and for exploring data clusters on the sphere.

Our package provides an efficient approach for conducting one, two, and k-sample tests for goodness of fit, facilitating the examination of data across multiple groups and conditions. Moreover, it incorporates a unique clustering algorithm specifically designed for spherical data, leveraging Poisson kernel-based densities to enhance the interpretability and usability of clustering results. The inclusion of graphical functions further aids users in validating and visualizing their analyses, making QuadratiK a versatile tool for comprehensive data investigation.

QuadratiK stands at the forefront of integrating rigorous statistical tests with practical data analysis techniques, serving as a valuable asset for both researchers and practitioners aiming to draw useful insights from their data. By providing a suite of powerful analytical tools, the package supports a wide array of applications, from biostatistics to machine learning, and opens new avenues for robust data exploration and interpretation.

 

Implementing Empirical Likelihood Within the Causal Inference Framework to Study Causal Effects of Air Pollution on Birth Outcomes
Sima Sharghi, Postdoctoral Researcher, University of Rochester Medical Center

Many prior studies have found adverse associations between air pollution and birth outcomes. In the absence of randomized trials to study the effects of air pollution on human health, observational data have been utilized in which the researchers estimate the causal associations between air pollution and the health outcomes. Many of these studies rely on parametric assumptions that may not be realistic. In this work, we implement the nonparametric Empirical Likelihood Algorithm within the causal inference framework of both a classic methodology and a newer method that uses Machine Learning tools. We show the competitive results of the assumption free Empirical Likelihood in simulations and apply the methods to study the causal association between PM2.5, NO2 and Anogenital Distance at birth in a study across four sites in the US.

 

AI assisted analysis of survey data on adoption, use, and consequences of body-worn cameras in the Monroe County Sheriff’s Office **
Owen Shedden, Graduate Student, Department of Criminal Justice, R.I.T.
John McCluskey, Professor, Department of Criminal Justice, R.I.T.

Surveys of police have probed orientations towards the adoption, use, and consequences of body-worn cameras (BWC) among U.S. police agencies for more than a decade.  Sheriff’s offices have been relatively understudied during this period as organizations that have adopted BWC. Data collected from road patrol deputies in the Monroe County Sheriff’s Office (MCSO)is analyzed using techniques that draw on AI-assisted Python programming. This paper will discuss AI as an application for enhancing visualization and for creating comparative analyses of respondents in this organization as compared to peers in other organizations that have adopted BWC.

 

MELD: A Bayesian Framework to Accelerate Biomolecular Simulations Using External Data § **
Alfonso Sierra Uran, Graduate Student, RIT Kate Gleason College of Engineering 
Emiliano Brini, Assistant Professor, Rochester Institute of Technology, School of Chemistry & Materials Science

Physics-based simulations like molecular dynamics (MD) can illuminate the molecular mechanism of many biological processes. However, plain MD is often computationally too expensive for practical use. The conformational space available to proteins, DNA, and other biological objects is too vast to be sampled effectively by vanilla MD simulations. Most approaches in the field get around this limitation by defining a path connecting several interesting conformations (states) and allowing the simulation to sample only along this pre-defined path. The problems with these approaches are that (1) such a path is sometimes extremely hard to define, and (2) states might also be hard to pinpoint. MELD (Modelling Employing Limited Data) introduces a different approach to this problem, eliminating the need to define a path and states. MELD leverages external information to limit the system's conformational space to only conformations that agree with external data and uses a replica exchange algorithm to move between states. MELD can leverage sparse, ambiguous, and uncertain data, which enables the use of information from experiments, AI predictions, bioinformatics tools, and general knowledge. This is possible thanks to the Bayesian framework of MELD, which, after converting the external information into sets of restraints, allows fractions of these sets to be active. This talk shows how the MELD approach can be applied to predicting protein structures and their stabilities, leveraging Alpha Fold and experimental data.

 

A hierarchical Bayesian model for the identification of technical length variants in miRNA sequencing data §
Hannah K. Swan, Graduate Student, University of Rochester

MicroRNAs (miRNAs) are small, single-stranded non-coding RNA molecules with important gene regulatory function. MiRNA biogenesis is a multi-step process, and certain steps of the pathway, such as cleavage by Drosha and Dicer, can result in miRNA isoforms that differ from the canonical miRNA sequence in nucleotide sequence and/or length. These miRNA isomers, called isomiRs, which may differ from the canonical sequence by as few as one or two nucleotides, can have different mRNA targets and stability than the corresponding canonical miRNA. As the body of research demonstrating the role of isomiRs in disease grows, the need for differential expression analysis of miRNA data at scale finer than miRNA-level grows too. Unfortunately, errors during the amplification and sequencing processes can result in technical miRNA isomiRs identical to biological isomiRs, making resolving variation at this scale challenging. We present a novel algorithm for the identification and correction of technical miRNA length variants in miRNA sequencing data. The algorithm assumes that the transformed degradation rate of canonical miRNA sequences in a sample follows a hierarchical normal Bayesian model. The algorithm then draws from the posterior predictive distribution and constructs 95% posterior predictive intervals to determine if the observed counts of degraded sequences are consistent with our error model. We present the theory underlying the model and assess the performance of the model using an experimental benchmark data set.
Additional Author: Matthew N. McCall, University of Rochester

 

Evaluating Joint Confidence Region of Hypervolume under ROC Manifold & Generalized Youden Index §
Jia Wang, Graduate Student, University at Buffalo

In biomarker evaluation/diagnostic studies, the hypervolume under the ROC manifold (JK) and the generalized Youden index (HUMK) are the most popular measures for assessing classification accuracy under multiple classes. While HUMK is frequently used to evaluate the overall accuracy, JK provides direct measure of accuracy at the optimal cut-points. Simultaneous evaluation of HUMK and JK provides a comprehensive picture about the classification accuracy of the biomarker/diagnostic test under consideration. This paper studies both parametric and non-parametric approaches for estimating the confidence region of HUMK and JK for a single biomarker. The performances of the proposed methods are investigated by an extensive simulation study and are applied to a real data set from the Alzheimer's Disease Neuroimaging Initiative (ADNI). 
Additional Authors: Jia Wang, Jingjing Yin, and Lili Tian XX, University at Buffalo

 

How can we leverage Large Language Models to Engage Students in Software Quality Improvement
Mohamed Wiem Mkaouer, PhD, Assistant Professor, Software Engineering, Rochester Institute of Technology

Static analysis tools are frequently used to scan the source code and detect deviations from the project coding guidelines. Yet, their adoption is challenged by their high false positive rate, which makes them not suitable for students and novice developers. However, Large Language Models (LLMs), such as ChatGPT, have gained widespread popularity and usage in various software engineering tasks, including testing, code review, and program comprehension. Such models represent an opportunity to reduce the ambiguity of static analysis tools and support their adoption. Yet, the effectiveness of using static analysis (i.e., PMD) to detect coding issues, and relying on LLMs (i.e., ChatGPT) to explain and recommend fix, has not yet been explored. In this talk, we aim to shed light on our experience in teaching the use of ChatGPT to cultivate a bugfix culture and leverage LLMs to improve software quality in educational settings. We share our findings to support educators in teaching students better code review strategies, increase students’ awareness about LLM, and promote software quality in education.

 

Forecasting Epidemiology: Time Series Modeling of the West Nile Virus §
Kiersten Winter, Undergraduate Student, Rochester Institute of Technology

Arthropod-borne viruses (arboviruses) are those that are transmitted to humans and/or other vertebrates via arthropod vectors, such as mosquitoes and ticks. Recent outbreaks of several epidemic arboviral diseases have posed serious global public health risks. The West Nile virus (WNV), in particular, is the most prevalent arbovirus in the United States. Consequently, the forecasting of this virus is a public health priority, but its occurrence has high spatiotemporal variation. It also has a complex ecology and there exist several environmental factors that influence mosquito populations, proving difficult when forecasting the long-term temporal trend of WNV. Therefore, it is the purpose of this work to compare the performances of several time series models with the inclusion of factors such as precipitation, temperature, and population. Further potential methods for improvement will also be discussed to better forecast West Nile virus incidences.

 

Image Processing with Optimally Designed Parabolic Partial Differential Equation §
Quiyi Wu, Graduate Student, University of Rochester

In imaging denoising tasks, brain imaging data like functional magnetic resonance imaging (fMRI) or positron emission tomography (PET) scans often contain noise and artifacts. Kernel smoothing techniques are essential for smoothing these images and play a pivotal role in brain imaging analysis. While kernel smoothing has been extensively studied in statistics, certain challenges remain, especially in the multi-dimensional landscape. Many existing methods lack adaptive smoothing capabilities and numerical flexibility in high dimensional setting, hindering the achievement of optimal results. To address this, we present an efficient adaptive General Kernel Smoothing-Finite Element Method (GKS-FEM). This method exploits the equivalence between GKS and the general second-order parabolic partial differential equation (PDE) in high dimensions. Utilizing the Finite Element Method (FEM), we discretize the PDE, leading to efficient and robust numerical smoothing approaches. This study establishes a bridge between statistics and mathematics. It applies mathematical techniques to address statistical challenges, including using the finite element method to develop efficient and robust kernel smoothing techniques. Conversely, it employs statistical methods to tackle mathematical issues, such as optimizing the bias-variance trade-off and leveraging functional principal component analysis for accelerated design of partial differential equations.

 

Interval-specific censoring set adjusted Kaplan–Meier estimator
Yaoshi Wu, PhD, Statistics, Director, Biostatistics/Biometrics, Cytokinetics

Interval-specific censoring set adjusted Kaplan–Meier estimator (WKE) is a non-parametric approach to reduce the overestimation of the Kaplan-Meier estimator (KME) when the event and censoring times are independent. The article is published in J. Appl. Stat1. We adjusted the KME based on a collection of intervals where censored data are observed between two adjacent event times. WKE is superior to KME as it substantially reduces the overestimation on survival rate and median survival time in the presence of censoring data. When there are no censored observations, or the sample size goes to infinity, WKE reduces to KME. We proved theoretically that WKE reduces overestimation compared to KME and provided a mathematical formula to estimate the variance of the proposed estimator based on Greenwood’s approach. We performed four simulation studies to compare WKE with KME when the failure rate is constant, decreasing, increasing, and based on the flexible hazard method. The bias reduction in median survival time and survival rate using WKE is considerably large, especially when the censoring rate is high. The standard deviations are comparable between the two estimators. We implemented WKE and KME for the Nonalcoholic Fatty Liver Disease patients from a well conducted population study. The results based on the actual data also show WKE substantially reduces the overestimation in the presence of high observed censoring rate.

1 Yaoshi Wu & John Kolassa (25 Dec 2023): Interval-specific censoring set adjusted Kaplan–Meier estimator, Journal of Applied Statistics, DOI: 10.1080/02664763.2023.2298795

 

Data integration and subsampling techniques in distribution estimation for event times with missing origins
Yi Xiong, Assistant Professor, Department of Biostatistics, SPHHP, University at Buffalo

Time-to-event data with missing origins often arises when the occurrence of the event is silent. For example, records of wildfires can only be collected after a fire has been reported and thus the exact time when the fire starts is unknown. To tackle this issue, Xiong et al. (2021) proposed an approach that synthesizes auxiliary longitudinal measures to aid the inference on the unobserved time origins via the first-hitting-time model. In this work, we consider using alternative auxiliary data, which is collected prior to the occurrence of the event, to tackle the issue of missing time origins. Motivated by the example of estimating distribution of duration for wildfires, we propose to use the preceding records of lightning strikes to aid inference of the ignition time and start time of a fire. We first integrate the lightning strikes data with the fire data via Kernel smoothing and then provide a distribution estimator for a fire’s ignition time. By viewing a fire’s start time subject to be censored within the interval of the ignition time and the report time, we further adjust Turnbull estimator with interval-censored data to estimate the distribution of missing origin. Driven by the large volume of lightning strikes data, we also adapt the proposed estimation procedures to sub-samples of the lightning strikes data. The proposed approach potentially has many applications.

 

Current Trends in Federated Learning: Do they Align with Real-world Application Requirements? §
Raman Zatsarenko, Undergraduate Student, Rochester Institute of Technology

Federated learning (FL) is revolutionizing machine learning by training models on distributed devices and servers, keeping sensitive data local. This privacy-by-design approach is perfect for healthcare, finance, and telecommunications. Here, we explore real-world aggregation algorithms, their security implications, and broader FL security trends. We'll also discuss the practicality, benefits, and limitations of current methods. While challenges remain, FL's potential to safeguard privacy and leverage decentralized data makes it a fascinating area of research with significant real-world applications. Advancements in security and aggregation algorithms are crucial for its wider adoption and impact.