Schedule 2023

Friday, April 21st

3:00-4:00pm Registration [Carlson, outside of Room 1125]

3:30-4:00 Welcome! [Carlson 1125]

4:00-5:00 Tutorials

T1: Basic Intro to R and Python Libraries for Data Science [Gleason 2149]

Gregory Babbitt, Associate Professor, Rochester Institute of Technology

We will spend the hour introducing common tasks and basic syntax differences in both languages. We will review some important packages/libraries/modules in both languages used for data management, statistics, machine learning, and data visualization. We will then have a live coding demonstration with example codes available from my RIT course website. Please have R/Rstudio and python and a decent multi-language coding editor like VScode or Komodo installed before you arrive. It would also be helpful to install the following packages/libraries/modules. R = (ggplot2, class, MASS, e1071, neuralnet, caret, kernlab, ada, randomForest) python = (pandas, scikit-learn, matplotlib, plotnine, seaborn, numpy, scipy).

T2: Introduction to Natural Language Processing [Gleason 2159]

Le Nguyen, Graduate Student, Rochester Institute of Technology

ln8378@g.rit.edu

In this tutorial, we will cover the fundamentals of Natural Language Processing. We will start with processing natural language. Then we will use NLP techniques to analyze our language data to gain key insights. Finally we will use models to derive usable information from processed language.

Topics covered: Natural language cleaning, stemming, and tokenization. Language analysis techniques, term frequency, embeddings, part of speech tagging.Using language models for sentiment analysis, predictive text, language classification.

5:00-6:00 Tutorials

T3: Community Detection in Complex Networks [Gleason 2149]

Nishant Malik, Assistant Professor, Rochester Institute of Technology

nxmsma@rit.edu

Many natural and social systems organize as networks, a few well-known examples of networks include the animal brain, electrical power grids, the internet, online social networks such as Facebook and Twitter, the relationships between genes and diseases, collaboration and citation among scientists, trade among countries, as well as interactions between financial markets. Most of these networks have nontrivial structural properties, hence the name complex networks. Mathematical analysis of complex networks has led to many successes, such as improving our understanding of the human brain's working and developing novel intervention and vaccination strategies to stop the spreading of diseases.

Numerous biological, social, and technological networks have modular structures: networks that consist of modules of nodes called communities, where the connectivity is dense within these communities. We will learn various algorithms for detecting community structures in networks during this tutorial. We will use Python's NetworkX package and apply these algorithms to several real-world data sets.

T4: Exploration of Deep Learning [Gleason 2159]

Ernest Fokoué, Professor, Rochester Institute of Technology

epfeqa@rit.edu

This tutorial is intended to be a breeze in the park through which I plan to share my initiation into Prompt Engineering around various themes. I will specially share a wide variety of my sessions with ChatGPT, including advanced statistical theory, computer programming, statistical programming, poetry, basic clerical letter writing, religion, fitness, comedy, tragedy, songwriting, planning my soccer practice session, basic proofs in mathematics.

Notes: ChatGPT: If you haven’t yet done so, then you are strongly encouraged to create your own OpenAI Account so that you can interact with me straight from within ChatGPT during the tutorial

6:30-8:00 Panel Discussion [Carlson 1125]

Light hors d'oeuvres served.

Saturday, April 22nd

8:00-8:30 Breakfast/registration [Carlson]

8:30-9:00 Welcome [Carlson 1125]

9:00-9:55 Parallel sessions

Session 1A: Clinical Trials and the Use of Statistics [Gleason 1139]

Session 1B: Statistics in Climate Science 1 [Gleason 1149]

10:00-10:55 Parallel sessions

Session 2A: Statistics in Climate Science 2 [Gleason 1139]

Session 2B: Statistics in Undergraduate Education [Gleason 1149]

11:00-11:55 Parallel sessions

Session 3A: Large Language Models [Gleason 1139]

Session 3B: Statistics in Space [Gleason 1149]

12:00-2:00 Lunch with Poster Session [Carlson 1125]

See abstracts on page XX

2:30-3:25 Parallel sessions

Session 4A: Intelligence: From humans to animals to concepts to silicon chips [Gleason 1139]

Session 4B: Solving Practical Problems [Gleason 1149]

3:30-4:00 Data Competition Presentations [Carlson 1125]

Three 10-minute presentations from the top 3 papers!

4:00-4:30 Break & Student Awards Judging

4:30-5:00 Awards & Closing Remarks [Carlson 1125]

UP-STAT 2023

DETAILED PARALLEL TALK PROGRAM FOR SATURDAY, APRIL 21

SESSION 1A Gleason 1139

Clinical Trials and the Use of Statistics

Session Chair:

9:00-9:15 A Pattern Discovery Algorithm for Pharmacovigilance Signal Detection §

Anran Liu, Graduate Student, University at Buffalo

9:20-9:35 Sensitivity Analysis for Constructing Optimal Treatment Regimes in the Presence of Non-compliance and Two Active Treatment Outcomes §

Cuong Pham, Graduate Student, University of Rochester Medical Center

9:40-9:55 Decreased Respiratory-Related Absenteeism Among Pre-School Students After Installation of Upper-Room Germicidal Ultraviolet Light: Analysis of Newly Discovered Historical Data

Christopher Ryan, Associate Professor,

SUNY Upstate Medical University, Broome County Health Department

SESSION 1B Gleason 1149

Statistics in Climate Science 1

Session Chair:

9:00-9:15 A Conditional Approach for Joint Estimation of Wind Speed and Direction §

Qiuyi Wu, Graduate Student, University of Rochester Medical Center

9:20-9:35 Quantifying the Nexus of Climate, Economy and Health: A State-of-the-Art Time Series Approach §

Kameron Kinast, Graduate Student, Rochester Institute of Technology

SESSION 2A Gleason 1139

Statistics in Climate Science 2

Session Chair: Dr. Tony E. Wong,^[a] Professor, Rochester Institute of Technology

10:00-10:15 Impacts of Warming Thresholds on Uncertainty in Future Coastal Adaptation Costs §

Selorm Dake, Undergraduate Student, Rochester Institute of Technology

10:20-10:35 Assessing Sensitivity of Coastal Adaptation Costs to Sea Level Rise Across Different Future Scenarios with Random Forests §

Prasanna Ponfilio Rodrigues, Graduate Student, Rochester Institute of Technology

10:40-10:55 Evaluating the Impacts of Structural Uncertainty in Sea Level Rise Models on Coastal Adaptation §

Kelly Feke, Undergraduate Student, Rochester Institute of Technology

SESSION 2B Gleason 1149

Statistics in Undergraduate Education

Session Chair: Dr. Susan E. Mason, Niagara University

10:00-10:15 Statistically Motivated

Elizabeth Reid, Assistant Professor of Mathematics, Marist College

10:20-10:35 Statistical Literacy

Susan E. Mason, Professor of Psychology, Niagara University

10:40-10:55 The Statistics Behind Casinos and Risks to the United States and Canada §

Joseph Martino, Undergraduate Student, Niagara University

SESSION 3A Gleason 1139

Statistical Modeling of Language

Session Chair:

11:00-11:15 Large Language Models Applied to the Identification of Social Determinants of Health §

Raktim Mukhopadhyay, Graduate Student, University at Buffalo

11:20-11:35 Graph-based Approach to Studying the Spread of Radical Online Sentiment §

Le Nguyen, Graduate Student, Rochester Institute of Technology

SESSION 3B Gleason 1149

Statistics in Space

Session Chair:

11:00-11:15 Poisson Kernel-based Tests for Uniformity on the d-dimensional Torus and Sphere

Giovanni Saraceno, Professional, University at Buffalo

11:20-11:35 Spectral Classification Discrepancy in Young Stars §

Alex Jermyn, Undergraduate Student, Rochester Institute of Technology

SESSION 4A Gleason 1139

Intelligence: From humans to animals to concepts to silicon chips

Session Chair: Dr. Ernest Fokoué, Rochester Institute of Technology

2:30-2:45 On the Emerging Platonic View of Statistical Learning Theory

Ernest Fokoué, Professor, Rochester Institute of Technology

2:50-3:05 The Evolution of Animal (+ Human) Intelligence

Gregory Babbitt, Associate Professor, Rochester Institute of Technology

3:10-3:25 On the Ubiquity of the Bayesian Paradigm in Statistical Machine Learning and Data Science

Ernest Fokoué, Professor, Rochester Institute of Technology

SESSION 4B Gleason 1149

Solving Practical Problems

Session Chair:

2:30-2:45 Data-Driven Optimization of Austin Police Staffing §

Adam Giammarese, Graduate Student, Rochester Institute of Technology

2:50-3:05 Identifying Opportunities for University COVID-19 Model Improvements Using Bayesian Model Calibration §

Meghan Rowan Childs, Graduate Student, Rochester Institute of Technology

3:10-3:20 New York Has An Energy Problem

Alex Elchev, Undergraduate Student, University of Rochester

§ Indicates presentations that are eligible for the student presentation awards.

Attendees may supply their evaluation here:

(All attendees? Select attendees? QR code?)

PARALLEL TALK ABSTRACTS

Ordered by speaker last name

The Evolution of Animal (+ Human) Intelligence

Gregory Babbitt, Associate Professor, Rochester Institute of Technology

The recent rapid development of machine intelligence opens currently unaddressed problems regarding inherent risk and moral ethics. It might be helpful to look towards biology, not only for inspiration regarding algorithm development, but also for lessons regarding how and why intelligence evolves in the first place. As a wildlife ecologist and computational biologist, I will review the evolutionary history of the independent rise of intelligence in very different animal groups including cephalopods, spiders, social insects, reptiles, birds, and a variety of mammals (including cetaceans, primates and humans). All of these independent evolutionary events share a common

Impacts of Warming Thresholds on Uncertainty in Future Coastal Adaptation Costs

Selorm Dake, Undergraduate Student, Rochester Institute of Technology

Greenhouse gas emissions cause a rise in temperature which increases the sea level around the world. Since sea level rise(SLR) is linked directly to changes in temperature, we can assess the effect of temperature change on it by setting thresholds. Within our given data set we could see the sea level rise accrue from year to year. We set thresholds by determining a year and amount of temperature change by which to sort the data. By analyzing the components of SLR data of scenarios that stay below a specific threshold we can find out the costs of coastal damages associated with those levels of change. When we set the threshold to zero we are able to understand the amount of sea level rise we are committed to as well as the damages we will incur regardless of the steps we take to mitigate sea level rise. As we extrapolate further into the future the amount of uncertainty in the change of temperature increases, directly influencing the adaptation costs of areas affected by sea level rise. Here we show that as the thresholds for temperature change increase the range for possible adaptation costs increases. We used an ensemble of temperature change simulations as our data set for the project. We split our simulations into two subsets, those that stayed below 2 deg C and those that stayed below 3 deg C at the year 2100. We used the corresponding sea level data from these subsets to assess the costs of damages for New York City. Our results corroborate the idea that the more we fail to limit warming, the more we characterize marginal increases in damages and coastal adaptation costs. This research shows the true costs of failing to adhere to our target temperature warming limits.

Authors: Selorm Dake, Tony Wong and Kelly Feke

New York Has An Energy Problem

Alex Elchev, Undergraduate Student, University of Rochester

The New York state electricity generation and distribution grid functions as a complex on-demand delivery supply chain. Consumer electricity purchases totaled $22.8 billion for 141.4 million megawatt hours (MWh) of electricity in 2021. This unit refers to the amount of electricity consumed or demanded over t=1 hour. Reducing this year-long demand to hourly units gives an hourly demand value that must be met with sufficient electricity generation capacity.

In the same year, New York state electricity generation had a total nameplate capacity of 376.3 million MWh, meaning the system functioned with a capacity factor of 33.2%. This unitless ratio indicates how much potential production was lost at generation sites due to sub-optimal operation and downtime. Hourly capacity carries expected costs in the form of fuel, facility upkeep, and recurring expenses, varying based on the facility’s age and method of generation.

Increasing the network's cumulative capacity factor is a direct approach to increasing the reliability and sustainability of utility-scale electricity generation in a given state or region. Nuclear power has an absolute advantage over both fossil-fueled generators and renewable generation methods in terms of capacity factor, emissions, and ability to meet consumer demands.

Increasing New York's nuclear generation capacity by 19,300 MWh of hourly capacity by 2050 will: i) replace all existing fossil-fueled generators; ii) meet all consumer demands; iii) reduce utility-scale gaseous emissions almost entirely; and iv) will optimize the state's electricity grid in accordance with the principles of supply chain management.

Evaluating the Impacts of Structural Uncertainty in Sea Level Rise Models on Coastal Adaptation

Kelly Feke, Undergraduate Student, Rochester Institute of Technology

Sea level rise (SLR) is a major consequence of climate change, posing significant threats to coastal communities and infrastructure worldwide. The area of impacted communities expands as the local sea level baseline increases. Adaptation actions must be taken to protect these communities from SLR. The costs of these actions are dependent on global and local mean sea level rise (GMSL and LMSL) calculated from sea-level models; however, the varieties of models and their associated structural uncertainties develop different sea level rise predictions. This influences the actions that should be taken to avoid the negative impact of local sea level rise. Although the uncertainties of SLR models have been studied, the variability of adaptation costs haven’t been given the same light, but it is these cost predictions that dictate the actions made to protect coastal areas. Here we characterize structural uncertainties within different SLR models and their impacts on future SLR predictions and predicted coastal damages. Using the integrated modeling framework Mimi and the MimiCIAM coastal impacts model, we compute the distributions of adaptation costs using 20 different models from the Coupled Model Intercomparison Project (CMIP6). These distributions characterize the uncertainties in future adaptation costs stemming from model structural uncertainty. Uncertainties can lead to poor estimates of risk and costs, which drives the need to quantify the uncertainties associated with SLR models and evaluate their impacts on adaptation decisions.

Authors: Kelly Feke, Tony Wong (RIT)

On the Ubiquity of the Bayesian Paradigm is Statistical Machine Learning and Data Science

Ernest Fokoue, Professor, Rochester Institute of Technology

This talk explores the myriad of ways in which the Bayesian paradigm permeates the entire landscape of Statistical Machine Learning and Data Science. Despite some of the major challenges underlying its practical use, the Bayesian paradigm has proven to be ubiquitous, often appearing directly and indirectly in virtually every single aspect of statistical machine learning and data science, and artificial intelligence. This presentation highlights some of the emerging ways in which the Bayesian paradigm is playing an impactful role in the Data Science Revolution.

On the Emerging Platonic View of Statistical Learning Theory

Ernest Fokoue, Professor, Rochester Institute of Technology

Learning Using Statistical Invariants (LUSI) is a relatively recent incarnation in the world of statistical learning theory paradigms. In their effort to propose what they hope to be a complete statistical theory of learning, Vapnik and Izmailov (2019) develop the LUSI framework, partly using their early tool known as the V-matrix but crucially borrowing heavily on Plato's philosophical teachings on ideas and things (forms) to extend the classical statistical learning theory from its purely empirical nature (known seen as brute force learning) to a learning theory based on predicates that minimize the true error. This talk will review the merits and the promises of LUSI and explore the ways in which Plato's philosophical teachings contain the potential of helping usher in a new era in Statistical Learning Theory.

Data-Driven Optimization of Austin Police Staffing

Adam Giammarese, Graduate Student, Rochester Institute of Technology

Spectral Classification Discrepancy in Young Stars

Alex Jermyn, Undergraduate Student, Rochester Institute of Technology

Proper spectral classification allows for the determination of many stellar properties including age and mass. Many young stars are located in areas of dense interstellar dust that necessitates observations in near-infrared (NIR). This introduces an issue that several systems have shown a difference of spectral class in NIR compared to optical observation. I am looking at a small group of nearby young stars to characterize this discrepancy in its consistency and magnitude. This presentation shows the initial data analysis and fitting attempts. Further data gathering and characterization methods are also discussed along with possible reasons for the observed discrepancies.

Quantifying the Nexus of Climate, Economy, and Health: A State-of-the-Art Time Series Approach

Kameron Kinast, Graduate Student, Rochester Institute of Technology

Extreme weather events pose significant threats to human life, the economy, agriculture, and various other socio-economic aspects. This thesis presents a comprehensive analysis of the patterns of climate factors and their impact on the economy and human health using state-of-the-art and emerging statistical machine learning techniques. This research consists of two parts: exploring and comparing the effectiveness of statistical models with respect to climate time series forecasting and analyzing the effects on the economy and human health. The study employs a predominantly computational approach, leveraging R, Python, and Julia to demonstrate the role of statistical computing in understanding climate change and its impacts. This thesis aims to construct powerful statistical models that establish a functional relationship between climate measurements, economic indicators, and human health. Furthermore, we speculate on potential causal relationships within the data to contribute to a deeper understanding of the causes and consequences of extreme weather events. By providing insights into the complex interplay of climate factors, economy, and health, this research seeks to inform evidence-based policy decisions that help mitigate the adverse effects of extreme weather events and foster resilience in the face of dangerous climate change.

A Pattern Discovery Algorithm for Pharmacovigilance Signal Detection

Anran Liu, Graduate Student, University at Buffalo

Safety of medical products continues to be a major public health concern worldwide. Spontaneous Reporting Systems (SRS), such as the FDA Adverse Event Reporting System (FAERS), are critical tools in the post-marketing evaluation of medical product safety. A variety of approaches have been developed for identification of adverse events using data that reside in FAERS and other SRS databases.

In this talk, we propose a pattern discovery algorithm, named Modified Detecting Deviating Cells (MDDC) algorithm, for the identification of adverse events when the database is represented as an I x J contingency table. The MDDC procedure is based on the standardized Pearson residuals of the pairs of potential adverse event-drug combinations, allowing the change of scale from categorical to interval/ratio. The method is 1) easy to compute; 2) considers the relationship between the different adverse events; 3) depends on a data driven cutoff. We study the performance of our method via simulation and by application on a specific drug class dataset downloaded from FAERS.

Authors: Anran Liu, Marianthi Markatou

The Statistics Behind Casinos and Risks to the United States and Canada

Joseph Martino, Undergraduate Student, Niagara University

This presentation will go over the statistics that are used in casinos and compare betting patterns in the United States and Canada. Statistics are found everywhere within a casino, bringing those betting to a mathematical disadvantage in the games they play. House edge is a way that casinos can bias the games into their favor by formatting the rules to result with a statistical advantage over the players, allowing them to be confident that they’ll win money in the long run. Naturally, casinos do not want to have a negative house edge, so there are only two games that can allow this. The first is blackjack, but only if one utilizes the counting cards strategy and is covert enough to not be caught doing it. The second is select online poker games, although the limit on how much one can bet on them results with a very small player profit. The top five favorite games in Canada and the United States reveal how both countries have their strengths and weaknesses in being careful with their money, as well as how online betting can bring its own set of risks. The presentation will conclude with a discussion of strategies one can take to have an enjoyable yet safe experience on a trip to a casino.

Authors: Joseph Martino, Kylee Healy, and Dr. Susan Mason.

Statistical Literacy

Susan E. Mason, Professor, Niagara University

In introductory-level statistics courses we teach statistical knowledge and skills, but are students also acquiring the critical thinking skills necessary for statistical literacy? As expressed by Milo Schield (2005), who has written extensively on the topic, “Statistical literacy is for data consumers while statistical competence is for data producers.” Not all statistics students will go on to be data producers, but all will be data consumers. We consume statistical information daily. It is embedded in news reports, sports write-ups, advertisements and so on. If we are not prepared to critically evaluate the information we receive, we can be easily influenced and easily misled. Statistical illiteracy can have a detrimental effect on both the individual and society. Consider, for example, the impact of data misinformation and data misinterpretation on political behavior and on the pandemic response. In this presentation we will discuss the importance of statistical literacy, and we will review effective methods for teaching students to be critical consumers of statistical information.

Authors: Susan E. Mason, Kylee A. Healy, and Joseph D. Martino

Large Language Models Applied to the Identification of Social Determinants of Health

Raktim Mukhopadhyay, Graduate Student,

CDSE Program & Department of Biostatistics, University at Buffalo

Identifying social determinants of health and understanding their impact on the well-being of individuals is an important step in facilitating positive changes in health outcomes. Collection, integration, and effective use of clinical data for this purpose present a variety of challenges. In this talk, we will discuss work that uses large language models to aid in the identification of social determinants of health in people with opioid use disorder. We outline the challenges of collecting, integrating, and using social determinants of health data, as well as the challenges associated with the use of large language models. We also present our solutions along with the reproducible workflows that allow data integration from disparate clinical sources. If time permits, we will discuss the creation of the meta-form that is used to obtain the relevant data for statistical analysis.

This work is a collaboration with Dr. AH Talal, Dr. O Kennedy, Dr. A Dharia, and Mr. M Brachmann, Jacobs School of Medicine, CSE, and Breadcrumb Company.

Authors: Raktim Mukhopadhyay and Marianthi Markatou

Graph-based Approach to Studying the Spread of Radical Online Sentiment

Le Nguyen, Graduate Student, Rochester Institute of Technology

The spread of radicalization through the Internet is a growing problem. We are witnessing a rise in online hate groups, inspiring the impressionable and vulnerable population towards extreme actions in the real world. In this paper, we study the spread of hate sentiments in online forums by collecting 1,973 long comment threads (30+ comments) posted on dark-web forums and containing a combination of benign posts and radical comments on the Islamic religion. This framework allows us to leverage network analysis tools to investigate sentiment propagation through a social network. By combining sentiment analysis with Large Language Models, social network analysis, and graph theory, we aim to shed light on the propagation of hate speech in online forums and the extent to which such speech can influence individuals.

Sensitivity Analysis for Constructing Optimal Treatment Regimes in the Presence of Non-compliance and Two Active Treatment Options

Cuong Pham, Graduate Student, University of Rochester Medical Center

Existing literature on constructing optimal regimes often focuses on intention-to-treat analyses that completely ignore the compliance behavior of individuals. Instrumental variable-based methods have been developed for learning optimal regimes under endogeneity. However, when there are two active treatment arms, the average causal effects of treatments cannot be identified using a binary instrument, and thus the existing methods will not be applicable. To fill this gap, we provide a procedure that identifies an optimal regime and the corresponding value function as a function of a vector of sensitivity parameters. We also derive the canonical gradient of the target parameter and propose a multiply robust classification-based estimator of the optimal regime. Our simulations highlight the need for and usefulness of the proposed method in practice. We implement our method on the Adaptive Treatment for Alcohol and Cocaine Dependence randomized trial.

Assessing Sensitivity of Coastal Adaptation Costs to Sea Level Rise Across Different Future Scenarios with Random Forests.

Prasanna Ponfilio Rodrigues, Graduate Student, Rochester Institute of Technology

Sea level rise (SLR) is a crucial effect of climate change that is already affecting low-lying coastal areas throughout the world, causing economic, social, and environmental losses. As the global mean sea level is likely to rise by up to 1.6 meters by 2100, it is critical to examine the drivers of coastal damages and adaptation costs. This will provide policymakers with a quantified estimate of the benefits and costs of adaptation activities, allowing them to assess the viability and efficacy of various adaptation strategies. However, uncertainties in the physical processes of ice-sheet dynamics and ocean circulation, as well as different future pathways for socioeconomic development and greenhouse gas emissions, can greatly affect coastal adaptation costs. Here, we focus on four SSP (Shared Socioeconomic Pathways) - RCP (Representative Concentration Pathways) scenarios and examine the range of probable future coastal impacts. We use random forests to determine the most important model parameters that contribute to the adaptation costs under different SSP-RCP scenarios, and on different time-scales. These results highlight key processes and parameters for mitigating the dangerous and costly impacts of sea level rise under different socioeconomic and greenhouse gas scenarios.

Authors: Prasanna Ponfilio Rodrigues, Tony E. Wong and Carolina Estevez Loza

Statistically Motivated

Elizabeth Reid, Assistant Professor, Marist College

Introduction to Statistics is a required course for many college students. One of the hardest parts about teaching statistics is motivating students to care about the material and to think about the class as more than just another hurdle they need to jump through. In this talk we will discuss ways to get students to take a greater interest in the course by connecting it to their major, future jobs, and situations that they encounter in everyday life. By doing so, students get the most out of the class and are able to answer for themselves the all-too-common question “why do I need to know this?”.

Identifying Opportunities for University COVID-19 Model Improvements Using Bayesian Model Calibration

Meghan Rowan Childs, Graduate Student, Rochester Institute of Technology

From the beginning of the COVID-19 pandemic, universities have experienced unique challenges due to their multifaceted nature as a place of education, residence, and employment. Current research has used mathematical models to explore non-pharmaceutical approaches to combating COVID-19 and leveraged model parameters calibrated to local contexts, such as hospitals. However, key questions remain regarding the impacts of a model calibration that uses a university's complete semester of COVID-19 data on model performance and parameter inference. We use an adapted SEIR compartment model that represents a semi-enclosed campus population. We use surveillance testing data from Rochester Institute of Technology’s (RIT) Fall 2020 semester to leverage a formal Bayesian model calibration to quantify uncertainty in model parameters and identify and diagnose model shortcomings. Based on surveillance testing data we define a formal likelihood function and use Markov chain Monte Carlo for sampling. We use this model calibration to compare modeled positive tests and isolation population to the RIT data. From this comparison we diagnose the model's inaccuracy in representing false positive test results and the contribution of community transmission to campus infections. This diagnosis highlights the need for further model developments that better reflect false positive test results and the effects of outside community transmission. Our results demonstrate the discovery of model inaccuracies that will inform model developments to produce a more accurate model and improve inferred parameters.

Additional Authors: Dr. Tony E. Wong, Rochester Institute of Technology

Decreased respiratory-related absenteeism among pre-school students after installation of upper-room germicidal ultraviolet light: analysis of newly discovered historical data

Christopher W. Ryan, XX, SUNY Upstate Medical University and Broome County Health Department

The COVID-19 pandemic has brought renewed urgency to issues of air disinfection. Upper-room germicidal ultraviolet light (GUV) disinfects room air very efficiently. It's effect on practical outcomes in public settings is difficult to study and remains unclear, but history may provide some insights. I fit an interrupted time series model to a newly-discovered dataset of attendance records from a preschool in the US mid-Atlantic region, between 1941 to 1949, where GUV was installed in December 1945. GUV was associated with a sizable reduction in child absenteeism due to respiratory illnesses of any cause. Odds ratios for the effect ranged from 0.41 to 0.75, depending on season. In all but high summer, model-predicted absenteeism rates were reduced by a third to a half with GUV. In summary, installation of upper-room germicidal ultraviolet light in a preschool was followed by a significant and operationally meaningful reduction in absenteeism due to respiratory illness of any cause. Wider use of upper-room germicidal UV systems in schools and preschools may be worthwhile to reduce absenteeism due to illness, and the educational, social, and economic consequences that ensue.

Poisson kernel-based tests for uniformity on the d-dimensional torus and sphere

Giovanni Saraceno, XX, University at Buffalo

Spherical and toroidal data arise in various applications, such as neuroscience, computer vision and natural language processing. Uniformity tests on spherical or toroidal data are potentially applicable in the field of natural language processing, where they could be used to evaluate the quality of topic models, where they can provide a quantitative measure of the model's performance by assessing the distribution of the generated topics.

We propose a new approach for testing uniformity of distribution for data vectors on the d-dimensional hypersphere. Our tests rely on U-statistic and V-statistic estimates of the kernel-based quadratic distance between the hypothesized uniform distribution on the sphere and the empirical cumulative distribution function. We introduce a class of diffusion kernels and focus on the Poisson kernel, which forms the basis of our proposed uniformity tests. We obtain the Karhunen-Lo\`eve decomposition of the kernel, connect it with its degrees of freedom, and hence determine the power of the test via a tuning parameter, the diffusion parameter. We present an algorithm to optimize the choice of the tuning parameter such that maximum power is achieved. We then study the performance of the proposed tests in terms of level and power, for a number of alternative distributions. Our simulations demonstrate the superior performance of our method compared to other test procedures, such as the Rayleigh, Gin\'e, Ajne and Bingham test procedures, in the case of multimodal alternatives. We apply our tests to real-world data on the orbits of comets obtained from the NASA website.

A conditional approach for joint estimation of wind speed and direction

Qiuyi Wu, Graduate Student, University of Rochester

This study develops a statistical conditional approach to evaluate climate model performance in wind speed and direction and to project their future changes under the Representative Concentration Pathway (RCP) 8.5 scenario over inland and offshore locations across the continental United States (CONUS). The proposed conditional approach extends the scope of existing studies by a combined characterization of the wind direction distribution and conditional distribution of wind on the direction, hence enabling an assessment of the joint wind speed and direction distribution and their changes. A von Mises mixture distribution is used to model wind directions across models and climate conditions. Wind speed distributions conditioned on wind direction are estimated using two statistical methods, i.e., a Weibull distributional regression model and a quantile regression model, both of which enforce the circular constraint to their resultant estimated distributions. The proposed conditional approach enables a combined characterization of the wind speed distributions conditioned on direction and wind direction distributions, which offers a flexible alternative that can provide additional insights for the joint assessment of speed and direction.

Authors: Qiuyi Wu (speaker), Julie Bessac, Whitney Huang, Jiali Wang, and Rao Kotamarthi

POSTER ABSTRACTS

Ordered by presenter last name

How does changes in Bitcoin prices impact Inflation

Nidhi Baindur and Sylvia Eisenberg, Undergraduate Students, Rochester Institute of Technology

Cryptocurrencies such as Bitcoin have gained significant attention in recent years with their impact on the global economy being a topic of controversy and research. While some experts argue that Bitcoin can act as a viable safeguard against inflation, others argue that Bitcoin is a speculative investment and is not suitable as a transaction currency due to its limited transaction capacity. To address the debate on whether Bitcoin can be considered as a hedge against inflation, we conducted a data-driven study using time series analysis and Granger causality tests.

Our analysis focused on examining the relationship between Bitcoin prices and forward inflation rates. We used daily time series created from the percent change in daily Bitcoin prices and the 5-year forward inflation expectation rates provided by the Federal Reserve Bank of St. Louis between 2016 and 2022. We found Granger causality between Bitcoin and forward inflation rates in 2016 in the first order and in 2020, for the third and fifth orders. However, we found no Granger causality between the values in the rest of this time period. Our research could provide valuable insights for financial analysts, investors, traders, and the scientific community who are exploring the use of decentralized electronic forms of currency. Moving forward, we aim to continue our research by utilizing mathematical approaches to gain a deeper understanding of the viability of cryptocurrencies like Bitcoin as transactional currencies.

Authors:: Dr. Mary Lynn Reed, Nidhi Baindur, Sylvia Eisenberg

Enhancing Federated Learning Security with Reputation and Trust-Based Indicators

Sergei Chuprov, Graduate Student, Rochester Institute of Technology

In our work, we investigate the training Data Quality (DQ) degradation in Federated Learning (FL) due to malicious attacks against the training data or FL clients or due to technological factors such as software/hardware failures. We develop and propose the Reputation and Trust-based technique that allows detecting local clients who produce local anomalous models, which might be a result of a malicious attack. Using unsupervised models’ parameters clustering (K-means), we analyze local models transferred to the aggregation server. We estimate the distance between the major cluster centers to detect anomalous models. Based on this distance, we calculate our Reputation indicator, which we update in each training iteration. We employ our Trust indicator toward the local clients to exclude them from the aggregation. We demonstrate how this helps to enhance the privacy and performance of the produced global model.

Contextual Understanding of Cybersecurity Exploits/Vulnerability Descriptions through Self-Supervised Learning

Reza Fayyazi, Graduate Student, Rochester Institute of Technology

With the rise and development of the Internet, many systems around the world are susceptible to severe security threats. The volume, variety, and velocity of change in vulnerabilities and exploits have made incident and threat analyses challenging with human expertise and experience alone. Many Security Information and Event Management (SIEM) systems have been developed to produce and correlate intrusion logs and threat intelligence reports to assist security analysts. The description in these logs and reports, however, can be cryptic and not easy to interpret. Therefore, this research aims to assess the complexity and evolution of cybersecurity vulnerabilities and the challenges associated with categorizing them. There is a need to process evolving vulnerabilities to gain an understanding of the tactics and techniques used by adversaries when they are targeting a system. In our preliminary results, we saw that supervised learning models do not reliably identify specific vulnerabilities and exploits. Therefore, to identify useful representations of new/unseen vulnerabilities effectively, we propose a continuous self-supervised learning technique to improve their ability to identify and mitigate cybersecurity threats and enhance their overall security posture. We propose developing metrics to measure the degree of technicality between cybersecurity-specialized sentences. The metrics include: domain-specific word frequency, surprisal to capture rare words in the sentences, and average distance between the predictions of a fine-tuned language model with cybersecurity data and a regular language model. Finally, we propose getting an ensemble evaluation with these metrics to rank the importance of cybersecurity sentences effectively.

Authors: Reza Fayyazi (rf1679@rit.edu), Dr. Shanchieh Jay Yang (jay.yang@rit.edu)

Deepfake Bias: Analysis and Balanced Dataset Generation

Bryce Gernon, Undergraduate Student, Rochester Institute of Technology

Deepfakes are spreading faster than ever, but the detectors made to counter them continue to suffer from systemic bias. The datasets they work with are unbalanced, the training methodologies they use are inherently biased, and the models are often structured without bias in mind.

We exhibit a tool that can automatically analyze how balanced a dataset is and an analysis of many popular deepfake datasets using said tool. We also present experimental results in analyzing model bias and in the usage of multiple specialized models to explicitly account for the visual differences in the perceived race and gender of different faces so that bias can be tracked and prevented.

In addition, due to high levels of perceived bias in popular deepfake datasets, we will show an early version of a new deepfake dataset that is created specifically to minimize imbalance in perceived race and gender. This allows us to analyze the differences in model performance and bias when trained on balanced datasets vs unbalanced datasets.

Additional authors: Saniat Javid Sohrawardi, Matthew Wright

Predicate powered learning with DeepONet

Yang Liu, Graduate Student, Rochester Institute of Technology

The current trend of data-driven learning can be extremely data hungry. The new paradigm of learning is to reduce the amount of data needed by providing useful predicates. Inspired by Vapnik’s work on learning using statistical invariants, this work aims to provide a practical and effective framework to utilize predicates in deep learning. Building upon DeepONet, we propose that by using a universal operator approximator, we can effectively incorporate predicates into the model. We also propose an alternative interpretation of predicates, and its function in the learning problem.

Evaluating Wildfire Detection Sensors with Support Vector Machines

Megan Marra, Undergraduate Student, Rochester Institute of Technology

Support vector machine is a supervised machine learning algorithm for classification and regression. My specific study will use the radial basis function kernel to transform a non-linearly separable dataset into one that is linearly separable. When applied to Earth coverage data (primarily gathered by the MODIS programming sensor), SVM can evaluate the sensor's accuracy in detecting active wildfires. With increasing progress in this field, researchers can track the conditions related to wildfire occurrence and notify local fire departments before the actual event of a wildfire. Thus, reducing the amount of damages caused to the environment, properties, and number of fire-related casualties.

Adapting Transformer Networks for Improved Website Fingerprinting Classification Performance

Nate Mathews, Graduate Student, Rochester Institute of Technology

Website fingerprinting (WF) is a privacy attack that aims to infer a user's browsing behavior by analyzing encrypted network traffic. It is commonly used by surveillance agencies to monitor user activities on the internet. Currently, the most popular state-of-the-art WF attacks use relatively unsophisticated Convolutional Neural Network (CNN) architectures to perform website classification.

In this ongoing project, our goal is to improve the classification performance of website fingerprinting by leveraging the strengths of Transformer networks, which have shown impressive performance on a wide range of natural language processing and image classification tasks. We adapt the state-of-the-art Transformer models used for vision classification to be applicable to our website traffic. Furthermore, we are examining the use of masked input modeling loss to allow for self-supervised training of a WF model on unlabeled traffic samples. Preliminary results have so far shown that these Transformer architectures can achieve competitive performance when compared to the prior CNN architectures.

Overall, our work aims to contribute to the field of website fingerprinting by proposing a novel approach based on Transformer networks that can potentially improve the accuracy and robustness of website fingerprinting classification.

An Automated Post-Mortem Analysis of Vulnerability Relationships using Natural Language Word Embeddings

Benjamin S. Meyers, Graduate Student, Rochester Institute of Technology

The daily activities of cybersecurity experts and software engineers--code reviews, issue tracking, vulnerability reporting--are constantly contributing to a massive wealth of security-specific natural language. In the case of vulnerabilities, understanding their causes, consequences, and mitigations is essential to learning from past mistakes and writing better, more secure code in the future. Many existing vulnerability assessment methodologies, like CVSS, rely on categorization and numerical metrics to glean insights into vulnerabilities, but these tools are unable to capture the subtle complexities and relationships between vulnerabilities because they do not examine the nuanced natural language artifacts left behind by developers. In this work, we want to discover unexpected relationships between vulnerabilities with the goal of improving upon current practices for post-mortem analysis of vulnerabilities. To that end, we trained word embedding models on two corpora of vulnerability descriptions from Common Vulnerabilities and Exposures (CVE) and the Vulnerability History Project (VHP), performed hierarchical agglomerative clustering on word embedding vectors representing the overall semantic meaning of vulnerability descriptions, and derived insights from vulnerability clusters based on their most common bigrams. We found that (1) vulnerabilities with similar consequences and based on similar weaknesses are often clustered together, (2) clustering word embeddings identified vulnerabilities that need more detailed descriptions, and (3) clusters rarely contained vulnerabilities from a single software project. Our methodology is automated and can be easily applied to other natural language corpora. We release all of the corpora, models, and code used in our work.

Accurately Simulating a Medical Search Engine API with GPT

Pranav Nair, Graduate Student, Company: VisualDx | University: Rochester Institute of Technology

Large Language Models (LLMs) promise to replace hand-engineered pipelines and task-specific ML models with “prompt engineering.” Supposedly, a well-designed textual prompt, in concert with an LLM API, can lower the engineering cost of new Natural Language Processing (NLP) services. We investigate a specific use case: extracting ICD10 codes from medical search engine queries. ICD10 codes are a standard ontology for medical diagnoses (Lyme Disease has a code of A69.2). We prompt an LLM to behave like a JSON API that takes in a user query and returns the closest ICD10 code. This is a challenging task because it requires that the LLM not only conform to a pattern, but systematically return the correct answer (historically a challenge for language models). We input 394 real (anonymous) user queries from the VisualDx search engine into GPT 3.5 and GPT 4.0 to see how many of the outputs match our hand-checked ICD codes. We find GPT 3.5 and GPT 4.0 have an exact-match accuracy of 14.7% and 62.7%, respectively. However, a special feature of ICD10 codes is that “partial matches” are informative: If the initial characters of two different codes match, they are related in some way. If we check for exact matches of the portion of the code to the left of the decimal point, GPT 3.5 and GPT 4.0 achieve 47.5% and 84.5% accuracy, respectively. GPT 4.0 with a single prompt is accurate enough to plausibly replace a traditional NLP pipeline for this task.

Authors: Miguel Dominguez, Pranav Nair

Evaluating Deepfake Detector Robustness and Exploring Countermeasures Against Adversarial Deepfakes

Shaikh Akib Shahriyar, Graduate Student, Rochester Institute of Technology

Deepfake (DF) videos are getting better in quality and can be used for dangerous disinformation campaigns. To detect DFs, researchers have designed various models, and sequence-based models that utilize temporal information are more effective at detection than the ones that only detect intra-frame discrepancies. Unfortunately, DF detection is a perfect target for adversarial examples that could fool models and undermine attempts to curb disinformation. Thus, Improving the robustness of the DF detection models is paramount now. We explore whether we can generate adversarial examples that fool sequence-based DF detectors to better understand the threat they pose. Additionally, we explore the effects of different augmentation techniques and adversarial training on DF detectors in improving generalization capability and robustness respectively.

Authors: Shaikh Akib Shahriyar (RIT), Dr. Matthew Wright (RIT)

Exploring User-friendly Explanations for Deepfake Detection

Kelly Wu, Graduate Student, Rochester Institute of Technology

There has been a growing concern about the use of manipulated media. Deepfakes, AI-generated media meant to fool human eyes, have become central to the discussion as the supporting technology has improved. To help people in the battle with deepfakes, many detection techniques have been developed with promising results in the laboratory. Due to the black-box nature of the detection models, however, users may have a hard time understanding the models’ decisions. To bridge the gap between them, we need to provide user-friendly model explanations for deepfake detection. In this work, we explore the performance of existing model explanation methods in explaining the classification results of a deep-learning-based deepfake detection model and identify some insights for future works in explainability for deepfake detection.

Authors: Kelly Wu (RIT), Matthew Wright (RIT), Andrea Hickerson (Ole Miss), Yu Kong (Michigan State)

Applying Linear Probability Models for Binary Outcomes to Epidemiology and the Public Health Sciences

Ann Yao Zhang, Graduate Student, University of Rochester

The convention when analyzing dichotomous outcome variables in the public health sciences is to use the logit, and less commonly, probit models. The linear probability model (LPM) is the application of ordinary least squares to binary dependent variables, and is rarely used in epidemiology. This paper examines methodological criticisms of LPM in epidemiological textbooks and foundational literature and addresses common areas of concern raised by epidemiologists including nonlinearity between exposure and outcome, conditional heteroscedasticity, and predicting probabilities which are less than 0 or greater than 1. Advantages of the LPM are discussed in the context of applications to epidemiological research including the direct interpretation of parameter estimates as mean marginal effects, computational efficiency, and the LPM as the first stage of instrumental variable analysis. Based on applications in econometrics and other fields using large data, a framework for ascertaining the appropriateness of using the LPM for modeling binary outcomes in observational studies in the public health sciences is proposed.