UpStat 2024 Poster Abstracts

Ordered by presenter last name, § indicates eligibility for student awards.

 

Brewing Insights: An Analysis of Coffee Consumption Behavior in America §
Ella Adams, Undergraduate Student, University of Rochester

Utilizing preference data collected through popular coffee YouTuber creator James Hoffman’s Great American Taste Test, we attempt to analyze his conclusory statement that commercial businesses should not sell fermented coffee roasts on a large scale. Fermentation of coffee beans prior to brewing creates a fruity flavor, which Hoffman claims “should not be a kind of default option” for businesses due to preference for this flavor being highly polarized between the genders and age. This statement begs the question: who is the ideal consumer of fermented coffees? And, is preference so significantly affected that we can confidently conclude businesses should steer away from participating in the newest coffee trend? Through use of multiple regression, categorical data analysis, we show the impact on demographic data on coffee preference and consumption behaviors.

 

Methods to Quantify Uncertainties of Parameters in a Bayesian Inverse Problem in the Biomechanical Imaging Field §
Michael Bennett, Graduate Student, Rochester Institute of Technology

The elastic shear modulus is a parameter used to quantify tissue stiffness in the biomechanical imaging field. Tissue stiffness is an important indicator of pathologies in the soft tissue of the body. Obtaining the elastic shear modulus from ultrasound deformation data can be framed as an inverse problem. There are several methods for solving this inverse problem. One such method is to formulate the problem using Bayes theorem. This allows us to sample from a posterior distribution using Markov Chain Monte Carlo (MCMC) methods and obtain a parameter value and its uncertainty. Our research compares several MCMC methods to one another using diagnostic tests and compares the MCMC methods to an efficient Laplace approximation of the posterior. We implement said method on a simple quadratic model and a 1-D elasticity model governed by PDEs. In future research, we plan to extend the elasticity model to higher dimensions and apply it to image data.

 

Assessment of Different Imputation Methods on Optimal Predictors of Child Development §
Sreekar Challa, Undergraduate Student, University of Rochester

The existence of missing data is a large issue in the scope of working with survey data. There might be unforeseen barriers to data collection that prevent the accounting for of all data across all countries, thereby necessitating ways of dealing with this issue. To solve this issue, imputation may be deployed in order to estimate values for countries in the dataset of interest for which data does not exist. In this review, I work with this issue as I aim to grasp how different imputation methods work for the specific case of cross-country survey nonresponse. Here, I perform analysis and imputation on a dataset of predictors for the ECDI (Early Childhood Development Index) as classed by UNICEF, which contains variables with varying amounts of missingness. In particular, I choose two known imputation methods, namely hot-deck imputation and multiple imputation by chained equations (MICE) with the Heckman selection in order to fill in the missingness for the predictors. I subsequently run a backwards model selection procedure using the Akaike Information Criterion (AIC) on the datasets generated by each method to determine whether each imputation method has an impact on the resulting generated model and its fit for the data.

 

Locally Adaptive Random Walk Stochastic Volatility §
Jason B. Cho, Graduate Student, Cornell University

This paper introduces a novel Bayesian framework aimed at estimating time-varying volatility by extending Random Walk Stochastic Volatility (RWSV) with shrinkage priors. Unlike traditional models such as Stochastic Volatility or GARCH-type models, which often rely on restrictive assumptions, our framework offers smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty. Notably, we demonstrate that our method, Adaptive Stochastic Volatility with Dynamic Shrinkage Processes (ASV-DSP) exhibits remarkable resilience with low prediction error across various data generating scenarios. Furthermore, our method's capacity to yield smooth and interpretable estimates facilitates a clearer comprehension of underlying patterns and trends. This attribute makes our method a robust tool applicable across a wide range of disciplines, including Finance, Environmental Science, and Epidemiology, among others.
Authors: David S. Matteson, Department of Statistics and Data Science, Cornell University

 

Redistricting New York State with Information Maximization
Miguel Dominguez, Machine Intelligence Engineer, VisualDx

Redistricting is an important, but politically contentious task in a modern democracy: As the population of a state changes and moves, occasionally the state must be "redistricted" so that every elected official represents as close to an equal number of people as possible. This is a task fraught with potential for corruption, as political operatives may draw the district boundaries to their own advantage. While redistricting algorithms exist in the literature, making use of methods such as genetic algorithms and graph cuts, we argue that the task can be solved fairly and elegantly with a neural network optimized with an Information Maximization criterion. We predict for each member of the population x a probability distribution y over D districts. We want to minimize the conditional entropy H(y|x) for each member while maximizing the overall entropy H(y). We demonstrate in the case of New York State, we can draw a more equitable distribution than currently exists in reality without any supervision, purely by optimizing an InfoMax objective. We speculate that other discrete optimization tasks could be accomplished with carefully defined neural network loss functions whose local minima necessarily produced the desired results.

 

Clustering and Isolation Forests for Block Copolymer Phase Discovery §
Michael J. Grant, Graduate Student, Microsystems Engineering, Rochester Institute of Technology 

Block copolymers (BCPs) are a class of soft materials leveraged heavily across a diverse range of scientific and engineering disciplines. This ubiquity stems from the covalent linkage between the two dissimilar polymer chains hindering macrophase separation. Instead, the polymer blocks microphase separate – resulting in a rich phase diagram of varying morphologies. Interestingly, when one block is made completely chiral a helical shape emerges which perturbs the thermodynamics allowing access points to novel phases such as the helical and single gyroid mesophass. Yet, there has been minimal exploration of the chiral BCP phase space since. One hindrance to this endeavor is the vast chemical space one can explore and the risk that their starting point would yield already established phases. This “guess and check” technique would be highly inefficient. To address this, we developed an algorithm that leverages clustering with isolation forests on the radial distribution functions generated from Dissipative Particle Dynamics (DPD). We demonstrated this novel approach on a traditional BCP system and found that it recovers the already established phase diagram while also elucidating a possible regime for the onset of diamond network phases. In the future this process will be employed on helical block copolymers in hopes to unravel undiscovered chiral phases.
Additional Authors: Poornima Padmanabhan, Rochester Institute of Technology, Microsystems Engineering

 

Zero Inflated Spatial Precipitation Data §
Daniel Illera, Undergraduate Student, University of Rochester

For this poster, we will be looking at the Bayesian hierarchical spatial model which includes three components: the mean process which determines the functional form and variables that are used to describe the response. The spatial process which describes how the data will vary over spatial surfaces and the random noise term which can be thought to describe small amounts of variation at each point.
Thanks to the flexibility of this model, we explore different ways which we can model precipitation across the continental US on a single day. Naturally, this data is zero inflated because there are many locations where rain has not been observed. We look at log-normal and Poisson spatial models where we drop the zeroes from our data. Based on the posterior predictive distributions, the estimates from the log-normal were less variable and therefore appeared to be better than the Poisson model. We also considered Zero-Inflated Poisson and Negative Binomial model to see if the extra computation time was worth it.
Additional Authors: Joseph Ciminelli, Associate Professor, University of Rochester

 

An Analysis of the Total Number of Goals Scored Using Poisson Stepwise Regression §
Kirstyn Loftus, Undergraduate Student, Rochester Institute of Technology

In today’s world of sports, teams are constantly analyzing data and using statistical techniques to try and improve. The sport of soccer is no different, but there is admittedly less public research concerning it compared to sports such as baseball and football. I thus explore one of the most influential statistics on soccer teams’ success, the total number of goals scored by the team, and whether Poisson stepwise regression techniques can be applied to predict the total number of goals scored by the final whistle.

 

Epileptic Network Identification: Insights from Dynamic Mode Decomposition of sEEG Data
Alejandro Nieto Ramos, Post-doctoral Researcher, Cleveland Clinic Foundation

Neuronal dynamical behavior is characterized by high-dimensional data that is difficult to quantify. In particular, the investigation of governing spatiotemporal dynamics underlying transitory states is crucial in the study of epilepsy. Stereoelectroencephalography (sEEG) is a technique used in patients with refractory epilepsy to determine the spatiotemporal organization of seizures within the brain and identify the regions for possible resection. We employ an unsupervised data-driven algorithm called Dynamic Mode Decomposition (DMD) to find a linear approximation between consecutive voltage snapshots that describe the nonlinear dynamics of the signals found on sEEG recordings. Each mode obtained with DMD allows for temporal frequency identification, while permitting extraction of frequencies, growth rates and spatial structures. We used a subdivision of five frequency sub-bands---delta, theta, alpha, beta and gamma---and created plots considering the two highest frequency sub-bands along with specific modes associated. The plots help to visualize and identify the onset and early propagation of epileptiform network activity, enhancing sEEG analysis. We also created a higher frequency mode-related norm index consisting of a preictal to ictal ratio of values from the plots to summarize the contacts involved in the seizure. We applied the technique to three patients and found that higher frequencies are more informative and possibly representative of mode-specific network changes underscoring epileptiform activity. The technique was able to identify the epileptogenic zone and our results coincided with clinical findings. The technique was developed as proof of concept, and it is ready to be applied to a bigger consort or patients.

 

Visualizing and modeling a music scene with graphs §
Cas Savage, Undergraduate Student, Rochester Institute of Technology

Social subcultures or “scenes” have increasingly grown to be a part of our lives and identities as social media and the internet as a whole make it easier to connect to people with similar interests. The live music scene in particular simultaneously offers in person entertainment and stages for self-expression. Some scenes can be more tight-knit or dispersed, centralized or decentralized, and graph theory offers us tools to understand these connections. This poster gathers data on band membership and association between bands via shared musicians primarily at the Rochester Institute of Technology, but also in the wider Rochester music scene. The data gathered includes 68 bands, 196 musicians, and 319 links. Through this data, conclusions can be drawn on connectivity inside a music scene and how it relates to other metrics such as the amount of shows a band plays or their social media following. I also explore methods of visualization for this data using principles of physics and computer simulation, which can be used to display large and highly interconnected graphs, helping us visualize large quantities of data and make it easier to draw conclusions about complex networks such as other social scenes. Some insights gained on a numerical level are that the average number of members in a band is µ = 4.651 and varies with σ = 1.950 and the amount of bands per person has µ = 1.340 and σ = 0.651

 

Graphical Analysis of Proteomic Data from the Archaeal Proteome Project §
Josiah Shaffer, Undergraduate Student, Rochester Institute of Technology

Proteomics involves protein identification, post-translational modifications, localization and quantification of proteins, and the product of the central dogma of molecular biology. In this work, we performed graphical and statistical analysis for a large mass spectrometric dataset of mass spectra meter readings which help to identify and quantify different proteins. We were given peptide quantities, statistical significance, and fold changes to preform analysis. We formed the following research questions: 

  • Which peptides have the highest average intensities? There are over 2000 different peptides that need to be organized.
  • Which protein groups are the least stable and most affected by condition changes? Stable refers to the difference within fold changes. A large difference indicates a less stable protein group.
  • Do certain condition comparisons result in more negative or positive values? The negative and positive difference of the fold change values represent either initial or resulting phase.
  • How is each condition comparison distributed (PEP)? The PEP values refer to significance for the protein during the mass spectra meter readings.
  • How is each condition comparison distributed (log2 fold change)? Specific peptides’ quantity can be controlled under certain fold change conditions.

These visualizations can provide alternative conclusions to results discovered in scientific testing. We can identify proteins that survive better in particular fold changes and create more or less for significant proteins that are wanted. These proteins can help in the healthcare industry and work towards vaccination and/or other molecular level diseases prevention. 
Authors: Dr. Nonhle Channon Mdziniso, Dr. Stefan Schulze, Faith Wong, Andy Kang; RIT

 

Improving Exploration-Exploitation Trade-offs in Bayesian Optimization Using Conformal Prediction §
Marzieh Amiri Shahbazi, Graduate Student, Rochester Institute of Technology

Abstract coming...

 

Analyzing students’ perceptions of racial equity in college using sentiment analysis techniques §
Hannah Sheets, Graduate Student, Rochester Institute of Technology

Sentiment analysis is one of the commonly used methods of analyzing unstructured survey questions’ responses. Our survey focused on racial equity in the mathematical sciences at RIT and contained three open-ended questions for the students to answer. In using sentiment analysis, we’re able to gain a deeper understanding of students’ feelings and emotions towards how they feel they’re being treated throughout their early mathematics classes. We were specifically focused on gaining insight into any noticeable differences between AALANA (African American, Latinx American, and Native American) and non-AALANA students. Our first question was concerned with limitations to access of resources and opportunities within their classes. Many emotions were very similar between the two groups, however, non-AALANA students had higher counts in more positive emotions, such as trust and joy. In the second question, which asked about students’ perceptions on whether their professors believed in them, AALANA students with higher counts in joy and trust but also much higher in anticipation. Our final unstructured question regarded unique elements of their math classes that helped lead to their success at RIT. These results were very similar to the second question’s results with AALANA students having higher levels of trust and joy emotions. To further this analysis, we’re now adding elements of grounded theory to our work and comparing themes that come up between the two groups. Thus far, the two groups have similar themes within their responses, indicating there may be a lower level of racial inequity happening in RIT math classes.
Additional Authors: Dr. Nonhle Channon and Dr. Teresa Gibson

 

A Review & Simulation Study on Comparison of Modern Bayesian Mixture Model Clustering Methods §
Shuliang Yu, Graduate Student, State University of New York at Buffalo

Model-based clustering is widely used in various domains, including genomic data analysis and topic modelling, for example, in subtyping patients with similar genomic profiles. Traditional finite models necessitate a pre-specified number of clusters, posing challenges when the cluster count is unknown or dynamic with data size. The concepts of sparse finite mixture introduced by Malsiner Walli et al. and Dirichlet process introduced by Thomas S. Ferguson enable infinite mixture models, where the number of clusters are obtained a posteriori from the data. We present an extensive review and comparison of the Bayesian estimation of all three model classes, with the random hyper priors for the weight distribution being estimated using slice sampling. Through simulations on univariate and multivariate Gaussian mixture data, it is shown that the choice of the hyper priors largely determines the effectiveness of clustering for both infinite mixture model classes.