Friday, April 21st
3:00-4:00pm Registration [Carlson, outside of Room 1125]
3:30-4:00 Welcome! [Carlson 1125]
4:00-5:00 Tutorials
T1: Basic Intro to R and Python Libraries for Data Science [Gleason 2149]
Gregory Babbitt, Associate Professor, Rochester Institute of Technology
gabsbi@rit.edu
We will spend the hour introducing common tasks and basic syntax differences in both languages. We will review some important packages/libraries/modules in both languages used for data management, statistics, machine learning, and data visualization. We will then have a live coding demonstration with example codes available from my RIT course website. Please have R/Rstudio and python and a decent multi-language coding editor like VScode or Komodo installed before you arrive. It would also be helpful to install the following packages/libraries/modules. R = (ggplot2, class, MASS, e1071, neuralnet, caret, kernlab, ada, randomForest) python = (pandas, scikit-learn, matplotlib, plotnine, seaborn, numpy, scipy).
T2: Introduction to Natural Language Processing [Gleason 2159]
Le Nguyen, Graduate Student, Rochester Institute of Technology
ln8378@g.rit.edu
In this tutorial, we will cover the fundamentals of Natural Language Processing. We will start with processing natural language. Then we will use NLP techniques to analyze our language data to gain key insights. Finally we will use models to derive usable information from processed language.
Topics covered: Natural language cleaning, stemming, and tokenization. Language analysis techniques, term frequency, embeddings, part of speech tagging.Using language models for sentiment analysis, predictive text, language classification.
5:00-6:00 Tutorials
T3: Community Detection in Complex Networks [Gleason 2149]
Nishant Malik, Assistant Professor, Rochester Institute of Technology
nxmsma@rit.edu
Many natural and social systems organize as networks, a few well-known examples of networks include the animal brain, electrical power grids, the internet, online social networks such as Facebook and Twitter, the relationships between genes and diseases, collaboration and citation among scientists, trade among countries, as well as interactions between financial markets. Most of these networks have nontrivial structural properties, hence the name complex networks. Mathematical analysis of complex networks has led to many successes, such as improving our understanding of the human brain's working and developing novel intervention and vaccination strategies to stop the spreading of diseases.
Numerous biological, social, and technological networks have modular structures: networks that consist of modules of nodes called communities, where the connectivity is dense within these communities. We will learn various algorithms for detecting community structures in networks during this tutorial. We will use Python's NetworkX package and apply these algorithms to several real-world data sets.
T4: Exploration of Deep Learning [Gleason 2159]
Ernest Fokoué, Professor, Rochester Institute of Technology
epfeqa@rit.edu
This tutorial is intended to be a breeze in the park through which I plan to share my initiation into Prompt Engineering around various themes. I will specially share a wide variety of my sessions with ChatGPT, including advanced statistical theory, computer programming, statistical programming, poetry, basic clerical letter writing, religion, fitness, comedy, tragedy, songwriting, planning my soccer practice session, basic proofs in mathematics.
Notes: ChatGPT: If you haven’t yet done so, then you are strongly encouraged to create your own OpenAI Account so that you can interact with me straight from within ChatGPT during the tutorial
6:30-8:00 Panel Discussion [Carlson 1125]
Light hors d'oeuvres served.
Saturday, April 22nd
8:00-8:30 Breakfast/registration [Carlson]
8:30-9:00 Welcome [Carlson 1125]
9:00-9:55 Parallel sessions
Session 1A: Clinical Trials and the Use of Statistics [Gleason 1139]
Session 1B: Statistics in Climate Science 1 [Gleason 1149]
10:00-10:55 Parallel sessions
Session 2A: Statistics in Climate Science 2 [Gleason 1139]
Session 2B: Statistics in Undergraduate Education [Gleason 1149]
11:00-11:55 Parallel sessions
Session 3A: Large Language Models [Gleason 1139]
Session 3B: Statistics in Space [Gleason 1149]
12:00-2:00 Lunch with Poster Session [Carlson 1125]
See abstracts on page XX
2:30-3:25 Parallel sessions
Session 4A: Intelligence: From humans to animals to concepts to silicon chips [Gleason 1139]
Session 4B: Solving Practical Problems [Gleason 1149]
3:30-4:00 Data Competition Presentations [Carlson 1125]
Three 10-minute presentations from the top 3 papers!
4:00-4:30 Break & Student Awards Judging
4:30-5:00 Awards & Closing Remarks [Carlson 1125]
UP-STAT 2023
DETAILED PARALLEL TALK PROGRAM FOR SATURDAY, APRIL 21
SESSION 1A Gleason 1139
Clinical Trials and the Use of Statistics
Session Chair:
9:00-9:15 A Pattern Discovery Algorithm for Pharmacovigilance Signal Detection §
Anran Liu, Graduate Student, University at Buffalo
9:20-9:35 Sensitivity Analysis for Constructing Optimal Treatment Regimes in the Presence of Non-compliance and Two Active Treatment Outcomes §
Cuong Pham, Graduate Student, University of Rochester Medical Center
9:40-9:55 Decreased Respiratory-Related Absenteeism Among Pre-School Students After Installation of Upper-Room Germicidal Ultraviolet Light: Analysis of Newly Discovered Historical Data
Christopher Ryan, Associate Professor,
SUNY Upstate Medical University, Broome County Health Department
SESSION 1B Gleason 1149
Statistics in Climate Science 1
Session Chair:
9:00-9:15 A Conditional Approach for Joint Estimation of Wind Speed and Direction §
Qiuyi Wu, Graduate Student, University of Rochester Medical Center
9:20-9:35 Quantifying the Nexus of Climate, Economy and Health: A State-of-the-Art Time Series Approach §
Kameron Kinast, Graduate Student, Rochester Institute of Technology
SESSION 2A Gleason 1139
Statistics in Climate Science 2
Session Chair: Dr. Tony E. Wong,[a] Professor, Rochester Institute of Technology
10:00-10:15 Impacts of Warming Thresholds on Uncertainty in Future Coastal Adaptation Costs §
Selorm Dake, Undergraduate Student, Rochester Institute of Technology
10:20-10:35 Assessing Sensitivity of Coastal Adaptation Costs to Sea Level Rise Across Different Future Scenarios with Random Forests §
Prasanna Ponfilio Rodrigues, Graduate Student, Rochester Institute of Technology
10:40-10:55 Evaluating the Impacts of Structural Uncertainty in Sea Level Rise Models on Coastal Adaptation §
Kelly Feke, Undergraduate Student, Rochester Institute of Technology
SESSION 2B Gleason 1149
Statistics in Undergraduate Education
Session Chair: Dr. Susan E. Mason, Niagara University
10:00-10:15 Statistically Motivated
Elizabeth Reid, Assistant Professor of Mathematics, Marist College
10:20-10:35 Statistical Literacy
Susan E. Mason, Professor of Psychology, Niagara University
10:40-10:55 The Statistics Behind Casinos and Risks to the United States and Canada §
Joseph Martino, Undergraduate Student, Niagara University
SESSION 3A Gleason 1139
Statistical Modeling of Language
Session Chair:
11:00-11:15 Large Language Models Applied to the Identification of Social Determinants of Health §
Raktim Mukhopadhyay, Graduate Student, University at Buffalo
11:20-11:35 Graph-based Approach to Studying the Spread of Radical Online Sentiment §
Le Nguyen, Graduate Student, Rochester Institute of Technology
SESSION 3B Gleason 1149
Statistics in Space
Session Chair:
11:00-11:15 Poisson Kernel-based Tests for Uniformity on the d-dimensional Torus and Sphere
Giovanni Saraceno, Professional, University at Buffalo
11:20-11:35 Spectral Classification Discrepancy in Young Stars §
Alex Jermyn, Undergraduate Student, Rochester Institute of Technology
SESSION 4A Gleason 1139
Intelligence: From humans to animals to concepts to silicon chips
Session Chair: Dr. Ernest Fokoué, Rochester Institute of Technology
2:30-2:45 On the Emerging Platonic View of Statistical Learning Theory
Ernest Fokoué, Professor, Rochester Institute of Technology
2:50-3:05 The Evolution of Animal (+ Human) Intelligence
Gregory Babbitt, Associate Professor, Rochester Institute of Technology
3:10-3:25 On the Ubiquity of the Bayesian Paradigm in Statistical Machine Learning and Data Science
Ernest Fokoué, Professor, Rochester Institute of Technology
SESSION 4B Gleason 1149
Solving Practical Problems
Session Chair:
2:30-2:45 Data-Driven Optimization of Austin Police Staffing §
Adam Giammarese, Graduate Student, Rochester Institute of Technology
2:50-3:05 Identifying Opportunities for University COVID-19 Model Improvements Using Bayesian Model Calibration §
Meghan Rowan Childs, Graduate Student, Rochester Institute of Technology
3:10-3:20 New York Has An Energy Problem
Alex Elchev, Undergraduate Student, University of Rochester
§ Indicates presentations that are eligible for the student presentation awards.
Attendees may supply their evaluation here:
(All attendees? Select attendees? QR code?)
PARALLEL TALK ABSTRACTS
Ordered by speaker last name
The Evolution of Animal (+ Human) Intelligence
Gregory Babbitt, Associate Professor, Rochester Institute of Technology
The recent rapid development of machine intelligence opens currently unaddressed problems regarding inherent risk and moral ethics. It might be helpful to look towards biology, not only for inspiration regarding algorithm development, but also for lessons regarding how and why intelligence evolves in the first place. As a wildlife ecologist and computational biologist, I will review the evolutionary history of the independent rise of intelligence in very different animal groups including cephalopods, spiders, social insects, reptiles, birds, and a variety of mammals (including cetaceans, primates and humans). All of these independent evolutionary events share a common
Impacts of Warming Thresholds on Uncertainty in Future Coastal Adaptation Costs
Selorm Dake, Undergraduate Student, Rochester Institute of Technology
Greenhouse gas emissions cause a rise in temperature which increases the sea level around the world. Since sea level rise(SLR) is linked directly to changes in temperature, we can assess the effect of temperature change on it by setting thresholds. Within our given data set we could see the sea level rise accrue from year to year. We set thresholds by determining a year and amount of temperature change by which to sort the data. By analyzing the components of SLR data of scenarios that stay below a specific threshold we can find out the costs of coastal damages associated with those levels of change. When we set the threshold to zero we are able to understand the amount of sea level rise we are committed to as well as the damages we will incur regardless of the steps we take to mitigate sea level rise. As we extrapolate further into the future the amount of uncertainty in the change of temperature increases, directly influencing the adaptation costs of areas affected by sea level rise. Here we show that as the thresholds for temperature change increase the range for possible adaptation costs increases. We used an ensemble of temperature change simulations as our data set for the project. We split our simulations into two subsets, those that stayed below 2 deg C and those that stayed below 3 deg C at the year 2100. We used the corresponding sea level data from these subsets to assess the costs of damages for New York City. Our results corroborate the idea that the more we fail to limit warming, the more we characterize marginal increases in damages and coastal adaptation costs. This research shows the true costs of failing to adhere to our target temperature warming limits.
Authors: Selorm Dake, Tony Wong and Kelly Feke
New York Has An Energy Problem
Alex Elchev, Undergraduate Student, University of Rochester
The New York state electricity generation and distribution grid functions as a complex on-demand delivery supply chain. Consumer electricity purchases totaled $22.8 billion for 141.4 million megawatt hours (MWh) of electricity in 2021. This unit refers to the amount of electricity consumed or demanded over t=1 hour. Reducing this year-long demand to hourly units gives an hourly demand value that must be met with sufficient electricity generation capacity.
In the same year, New York state electricity generation had a total nameplate capacity of 376.3 million MWh, meaning the system functioned with a capacity factor of 33.2%. This unitless ratio indicates how much potential production was lost at generation sites due to sub-optimal operation and downtime. Hourly capacity carries expected costs in the form of fuel, facility upkeep, and recurring expenses, varying based on the facility’s age and method of generation.
Increasing the network's cumulative capacity factor is a direct approach to increasing the reliability and sustainability of utility-scale electricity generation in a given state or region. Nuclear power has an absolute advantage over both fossil-fueled generators and renewable generation methods in terms of capacity factor, emissions, and ability to meet consumer demands.
Increasing New York's nuclear generation capacity by 19,300 MWh of hourly capacity by 2050 will: i) replace all existing fossil-fueled generators; ii) meet all consumer demands; iii) reduce utility-scale gaseous emissions almost entirely; and iv) will optimize the state's electricity grid in accordance with the principles of supply chain management.
Evaluating the Impacts of Structural Uncertainty in Sea Level Rise Models on Coastal Adaptation
Kelly Feke, Undergraduate Student, Rochester Institute of Technology
Sea level rise (SLR) is a major consequence of climate change, posing significant threats to coastal communities and infrastructure worldwide. The area of impacted communities expands as the local sea level baseline increases. Adaptation actions must be taken to protect these communities from SLR. The costs of these actions are dependent on global and local mean sea level rise (GMSL and LMSL) calculated from sea-level models; however, the varieties of models and their associated structural uncertainties develop different sea level rise predictions. This influences the actions that should be taken to avoid the negative impact of local sea level rise. Although the uncertainties of SLR models have been studied, the variability of adaptation costs haven’t been given the same light, but it is these cost predictions that dictate the actions made to protect coastal areas. Here we characterize structural uncertainties within different SLR models and their impacts on future SLR predictions and predicted coastal damages. Using the integrated modeling framework Mimi and the MimiCIAM coastal impacts model, we compute the distributions of adaptation costs using 20 different models from the Coupled Model Intercomparison Project (CMIP6). These distributions characterize the uncertainties in future adaptation costs stemming from model structural uncertainty. Uncertainties can lead to poor estimates of risk and costs, which drives the need to quantify the uncertainties associated with SLR models and evaluate their impacts on adaptation decisions.
Authors: Kelly Feke, Tony Wong (RIT)
On the Ubiquity of the Bayesian Paradigm is Statistical Machine Learning and Data Science
Ernest Fokoue, Professor, Rochester Institute of Technology
This talk explores the myriad of ways in which the Bayesian paradigm permeates the entire landscape of Statistical Machine Learning and Data Science. Despite some of the major challenges underlying its practical use, the Bayesian paradigm has proven to be ubiquitous, often appearing directly and indirectly in virtually every single aspect of statistical machine learning and data science, and artificial intelligence. This presentation highlights some of the emerging ways in which the Bayesian paradigm is playing an impactful role in the Data Science Revolution.
On the Emerging Platonic View of Statistical Learning Theory
Ernest Fokoue, Professor, Rochester Institute of Technology
Learning Using Statistical Invariants (LUSI) is a relatively recent incarnation in the world of statistical learning theory paradigms. In their effort to propose what they hope to be a complete statistical theory of learning, Vapnik and Izmailov (2019) develop the LUSI framework, partly using their early tool known as the V-matrix but crucially borrowing heavily on Plato's philosophical teachings on ideas and things (forms) to extend the classical statistical learning theory from its purely empirical nature (known seen as brute force learning) to a learning theory based on predicates that minimize the true error. This talk will review the merits and the promises of LUSI and explore the ways in which Plato's philosophical teachings contain the potential of helping usher in a new era in Statistical Learning Theory.
Data-Driven Optimization of Austin Police Staffing
Adam Giammarese, Graduate Student, Rochester Institute of Technology
Learning Using Statistical Invariants (LUSI) is a relatively recent incarnation in the world of statistical learning theory paradigms. In their effort to propose what they hope to be a complete statistical theory of learning, Vapnik and Izmailov (2019) develop the LUSI framework, partly using their early tool known as the V-matrix but crucially borrowing heavily on Plato's philosophical teachings on ideas and things (forms) to extend the classical statistical learning theory from its purely empirical nature (known seen as brute force learning) to a learning theory based on predicates that minimize the true error. This talk will review the merits and the promises of LUSI and explore the ways in which Plato's philosophical teachings contain the potential of helping usher in a new era in Statistical Learning Theory.
Spectral Classification Discrepancy in Young Stars
Alex Jermyn, Undergraduate Student, Rochester Institute of Technology
Proper spectral classification allows for the determination of many stellar properties including age and mass. Many young stars are located in areas of dense interstellar dust that necessitates observations in near-infrared (NIR). This introduces an issue that several systems have shown a difference of spectral class in NIR compared to optical observation. I am looking at a small group of nearby young stars to characterize this discrepancy in its consistency and magnitude. This presentation shows the initial data analysis and fitting attempts. Further data gathering and characterization methods are also discussed along with possible reasons for the observed discrepancies.
Quantifying the Nexus of Climate, Economy, and Health: A State-of-the-Art Time Series Approach
Kameron Kinast, Graduate Student, Rochester Institute of Technology
Extreme weather events pose significant threats to human life, the economy, agriculture, and various other socio-economic aspects. This thesis presents a comprehensive analysis of the patterns of climate factors and their impact on the economy and human health using state-of-the-art and emerging statistical machine learning techniques. This research consists of two parts: exploring and comparing the effectiveness of statistical models with respect to climate time series forecasting and analyzing the effects on the economy and human health. The study employs a predominantly computational approach, leveraging R, Python, and Julia to demonstrate the role of statistical computing in understanding climate change and its impacts. This thesis aims to construct powerful statistical models that establish a functional relationship between climate measurements, economic indicators, and human health. Furthermore, we speculate on potential causal relationships within the data to contribute to a deeper understanding of the causes and consequences of extreme weather events. By providing insights into the complex interplay of climate factors, economy, and health, this research seeks to inform evidence-based policy decisions that help mitigate the adverse effects of extreme weather events and foster resilience in the face of dangerous climate change.
A Pattern Discovery Algorithm for Pharmacovigilance Signal Detection
Anran Liu, Graduate Student, University at Buffalo
Safety of medical products continues to be a major public health concern worldwide. Spontaneous Reporting Systems (SRS), such as the FDA Adverse Event Reporting System (FAERS), are critical tools in the post-marketing evaluation of medical product safety. A variety of approaches have been developed for identification of adverse events using data that reside in FAERS and other SRS databases.
In this talk, we propose a pattern discovery algorithm, named Modified Detecting Deviating Cells (MDDC) algorithm, for the identification of adverse events when the database is represented as an I x J contingency table. The MDDC procedure is based on the standardized Pearson residuals of the pairs of potential adverse event-drug combinations, allowing the change of scale from categorical to interval/ratio. The method is 1) easy to compute; 2) considers the relationship between the different adverse events; 3) depends on a data driven cutoff. We study the performance of our method via simulation and by application on a specific drug class dataset downloaded from FAERS.
Authors: Anran Liu, Marianthi Markatou
The Statistics Behind Casinos and Risks to the United States and Canada
Joseph Martino, Undergraduate Student, Niagara University
This presentation will go over the statistics that are used in casinos and compare betting patterns in the United States and Canada. Statistics are found everywhere within a casino, bringing those betting to a mathematical disadvantage in the games they play. House edge is a way that casinos can bias the games into their favor by formatting the rules to result with a statistical advantage over the players, allowing them to be confident that they’ll win money in the long run. Naturally, casinos do not want to have a negative house edge, so there are only two games that can allow this. The first is blackjack, but only if one utilizes the counting cards strategy and is covert enough to not be caught doing it. The second is select online poker games, although the limit on how much one can bet on them results with a very small player profit. The top five favorite games in Canada and the United States reveal how both countries have their strengths and weaknesses in being careful with their money, as well as how online betting can bring its own set of risks. The presentation will conclude with a discussion of strategies one can take to have an enjoyable yet safe experience on a trip to a casino.
Authors: Joseph Martino, Kylee Healy, and Dr. Susan Mason.
Statistical Literacy
Susan E. Mason, Professor, Niagara University
In introductory-level statistics courses we teach statistical knowledge and skills, but are students also acquiring the critical thinking skills necessary for statistical literacy? As expressed by Milo Schield (2005), who has written extensively on the topic, “Statistical literacy is for data consumers while statistical competence is for data producers.” Not all statistics students will go on to be data producers, but all will be data consumers. We consume statistical information daily. It is embedded in news reports, sports write-ups, advertisements and so on. If we are not prepared to critically evaluate the information we receive, we can be easily influenced and easily misled. Statistical illiteracy can have a detrimental effect on both the individual and society. Consider, for example, the impact of data misinformation and data misinterpretation on political behavior and on the pandemic response. In this presentation we will discuss the importance of statistical literacy, and we will review effective methods for teaching students to be critical consumers of statistical information.
Authors: Susan E. Mason, Kylee A. Healy, and Joseph D. Martino
Large Language Models Applied to the Identification of Social Determinants of Health
Raktim Mukhopadhyay, Graduate Student,
CDSE Program & Department of Biostatistics, University at Buffalo
Identifying social determinants of health and understanding their impact on the well-being of individuals is an important step in facilitating positive changes in health outcomes. Collection, integration, and effective use of clinical data for this purpose present a variety of challenges. In this talk, we will discuss work that uses large language models to aid in the identification of social determinants of health in people with opioid use disorder. We outline the challenges of collecting, integrating, and using social determinants of health data, as well as the challenges associated with the use of large language models. We also present our solutions along with the reproducible workflows that allow data integration from disparate clinical sources. If time permits, we will discuss the creation of the meta-form that is used to obtain the relevant data for statistical analysis.
This work is a collaboration with Dr. AH Talal, Dr. O Kennedy, Dr. A Dharia, and Mr. M Brachmann, Jacobs School of Medicine, CSE, and Breadcrumb Company.
Authors: Raktim Mukhopadhyay and Marianthi Markatou
Graph-based Approach to Studying the Spread of Radical Online Sentiment
Le Nguyen, Graduate Student, Rochester Institute of Technology
The spread of radicalization through the Internet is a growing problem. We are witnessing a rise in online hate groups, inspiring the impressionable and vulnerable population towards extreme actions in the real world. In this paper, we study the spread of hate sentiments in online forums by collecting 1,973 long comment threads (30+ comments) posted on dark-web forums and containing a combination of benign posts and radical comments on the Islamic religion. This framework allows us to leverage network analysis tools to investigate sentiment propagation through a social network. By combining sentiment analysis with Large Language Models, social network analysis, and graph theory, we aim to shed light on the propagation of hate speech in online forums and the extent to which such speech can influence individuals.
Sensitivity Analysis for Constructing Optimal Treatment Regimes in the Presence of Non-compliance and Two Active Treatment Options
Cuong Pham, Graduate Student, University of Rochester Medical Center
Existing literature on constructing optimal regimes often focuses on intention-to-treat analyses that completely ignore the compliance behavior of individuals. Instrumental variable-based methods have been developed for learning optimal regimes under endogeneity. However, when there are two active treatment arms, the average causal effects of treatments cannot be identified using a binary instrument, and thus the existing methods will not be applicable. To fill this gap, we provide a procedure that identifies an optimal regime and the corresponding value function as a function of a vector of sensitivity parameters. We also derive the canonical gradient of the target parameter and propose a multiply robust classification-based estimator of the optimal regime. Our simulations highlight the need for and usefulness of the proposed method in practice. We implement our method on the Adaptive Treatment for Alcohol and Cocaine Dependence randomized trial.
Assessing Sensitivity of Coastal Adaptation Costs to Sea Level Rise Across Different Future Scenarios with Random Forests.
Prasanna Ponfilio Rodrigues, Graduate Student, Rochester Institute of Technology
Sea level rise (SLR) is a crucial effect of climate change that is already affecting low-lying coastal areas throughout the world, causing economic, social, and environmental losses. As the global mean sea level is likely to rise by up to 1.6 meters by 2100, it is critical to examine the drivers of coastal damages and adaptation costs. This will provide policymakers with a quantified estimate of the benefits and costs of adaptation activities, allowing them to assess the viability and efficacy of various adaptation strategies. However, uncertainties in the physical processes of ice-sheet dynamics and ocean circulation, as well as different future pathways for socioeconomic development and greenhouse gas emissions, can greatly affect coastal adaptation costs. Here, we focus on four SSP (Shared Socioeconomic Pathways) - RCP (Representative Concentration Pathways) scenarios and examine the range of probable future coastal impacts. We use random forests to determine the most important model parameters that contribute to the adaptation costs under different SSP-RCP scenarios, and on different time-scales. These results highlight key processes and parameters for mitigating the dangerous and costly impacts of sea level rise under different socioeconomic and greenhouse gas scenarios.
Authors: Prasanna Ponfilio Rodrigues, Tony E. Wong and Carolina Estevez Loza
Statistically Motivated
Elizabeth Reid, Assistant Professor, Marist College
Introduction to Statistics is a required course for many college students. One of the hardest parts about teaching statistics is motivating students to care about the material and to think about the class as more than just another hurdle they need to jump through. In this talk we will discuss ways to get students to take a greater interest in the course by connecting it to their major, future jobs, and situations that they encounter in everyday life. By doing so, students get the most out of the class and are able to answer for themselves the all-too-common question “why do I need to know this?”.
Identifying Opportunities for University COVID-19 Model Improvements Using Bayesian Model Calibration
Meghan Rowan Childs, Graduate Student, Rochester Institute of Technology
From the beginning of the COVID-19 pandemic, universities have experienced unique challenges due to their multifaceted nature as a place of education, residence, and employment. Current research has used mathematical models to explore non-pharmaceutical approaches to combating COVID-19 and leveraged model parameters calibrated to local contexts, such as hospitals. However, key questions remain regarding the impacts of a model calibration that uses a university's complete semester of COVID-19 data on model performance and parameter inference. We use an adapted SEIR compartment model that represents a semi-enclosed campus population. We use surveillance testing data from Rochester Institute of Technology’s (RIT) Fall 2020 semester to leverage a formal Bayesian model calibration to quantify uncertainty in model parameters and identify and diagnose model shortcomings. Based on surveillance testing data we define a formal likelihood function and use Markov chain Monte Carlo for sampling. We use this model calibration to compare modeled positive tests and isolation population to the RIT data. From this comparison we diagnose the model's inaccuracy in representing false positive test results and the contribution of community transmission to campus infections. This diagnosis highlights the need for further model developments that better reflect false positive test results and the effects of outside community transmission. Our results demonstrate the discovery of model inaccuracies that will inform model developments to produce a more accurate model and improve inferred parameters.
Additional Authors: Dr. Tony E. Wong, Rochester Institute of Technology
Decreased respiratory-related absenteeism among pre-school students after installation of upper-room germicidal ultraviolet light: analysis of newly discovered historical data
Christopher W. Ryan, XX, SUNY Upstate Medical University and Broome County Health Department
The COVID-19 pandemic has brought renewed urgency to issues of air disinfection. Upper-room germicidal ultraviolet light (GUV) disinfects room air very efficiently. It's effect on practical outcomes in public settings is difficult to study and remains unclear, but history may provide some insights. I fit an interrupted time series model to a newly-discovered dataset of attendance records from a preschool in the US mid-Atlantic region, between 1941 to 1949, where GUV was installed in December 1945. GUV was associated with a sizable reduction in child absenteeism due to respiratory illnesses of any cause. Odds ratios for the effect ranged from 0.41 to 0.75, depending on season. In all but high summer, model-predicted absenteeism rates were reduced by a third to a half with GUV. In summary, installation of upper-room germicidal ultraviolet light in a preschool was followed by a significant and operationally meaningful reduction in absenteeism due to respiratory illness of any cause. Wider use of upper-room germicidal UV systems in schools and preschools may be worthwhile to reduce absenteeism due to illness, and the educational, social, and economic consequences that ensue.
Poisson kernel-based tests for uniformity on the d-dimensional torus and sphere
Giovanni Saraceno, XX, University at Buffalo
Spherical and toroidal data arise in various applications, such as neuroscience, computer vision and natural language processing. Uniformity tests on spherical or toroidal data are potentially applicable in the field of natural language processing, where they could be used to evaluate the quality of topic models, where they can provide a quantitative measure of the model's performance by assessing the distribution of the generated topics.
We propose a new approach for testing uniformity of distribution for data vectors on the d-dimensional hypersphere. Our tests rely on U-statistic and V-statistic estimates of the kernel-based quadratic distance between the hypothesized uniform distribution on the sphere and the empirical cumulative distribution function. We introduce a class of diffusion kernels and focus on the Poisson kernel, which forms the basis of our proposed uniformity tests. We obtain the Karhunen-Lo\`eve decomposition of the kernel, connect it with its degrees of freedom, and hence determine the power of the test via a tuning parameter, the diffusion parameter. We present an algorithm to optimize the choice of the tuning parameter such that maximum power is achieved. We then study the performance of the proposed tests in terms of level and power, for a number of alternative distributions. Our simulations demonstrate the superior performance of our method compared to other test procedures, such as the Rayleigh, Gin\'e, Ajne and Bingham test procedures, in the case of multimodal alternatives. We apply our tests to real-world data on the orbits of comets obtained from the NASA website.
A conditional approach for joint estimation of wind speed and direction
Qiuyi Wu, Graduate Student, University of Rochester
This study develops a statistical conditional approach to evaluate climate model performance in wind speed and direction and to project their future changes under the Representative Concentration Pathway (RCP) 8.5 scenario over inland and offshore locations across the continental United States (CONUS). The proposed conditional approach extends the scope of existing studies by a combined characterization of the wind direction distribution and conditional distribution of wind on the direction, hence enabling an assessment of the joint wind speed and direction distribution and their changes. A von Mises mixture distribution is used to model wind directions across models and climate conditions. Wind speed distributions conditioned on wind direction are estimated using two statistical methods, i.e., a Weibull distributional regression model and a quantile regression model, both of which enforce the circular constraint to their resultant estimated distributions. The proposed conditional approach enables a combined characterization of the wind speed distributions conditioned on direction and wind direction distributions, which offers a flexible alternative that can provide additional insights for the joint assessment of speed and direction.
Authors: Qiuyi Wu (speaker), Julie Bessac, Whitney Huang, Jiali Wang, and Rao Kotamarthi
POSTER ABSTRACTS
Ordered by presenter last name
How does changes in Bitcoin prices impact Inflation
Nidhi Baindur and Sylvia Eisenberg, Undergraduate Students, Rochester Institute of Technology
Cryptocurrencies such as Bitcoin have gained significant attention in recent years with their impact on the global economy being a topic of controversy and research. While some experts argue that Bitcoin can act as a viable safeguard against inflation, others argue that Bitcoin is a speculative investment and is not suitable as a transaction currency due to its limited transaction capacity. To address the debate on whether Bitcoin can be considered as a hedge against inflation, we conducted a data-driven study using time series analysis and Granger causality tests.
Our analysis focused on examining the relationship between Bitcoin prices and forward inflation rates. We used daily time series created from the percent change in daily Bitcoin prices and the 5-year forward inflation expectation rates provided by the Federal Reserve Bank of St. Louis between 2016 and 2022. We found Granger causality between Bitcoin and forward inflation rates in 2016 in the first order and in 2020, for the third and fifth orders. However, we found no Granger causality between the values in the rest of this time period. Our research could provide valuable insights for financial analysts, investors, traders, and the scientific community who are exploring the use of decentralized electronic forms of currency. Moving forward, we aim to continue our research by utilizing mathematical approaches to gain a deeper understanding of the viability of cryptocurrencies like Bitcoin as transactional currencies.
Authors:: Dr. Mary Lynn Reed, Nidhi Baindur, Sylvia Eisenberg
Enhancing Federated Learning Security with Reputation and Trust-Based Indicators
Sergei Chuprov, Graduate Student, Rochester Institute of Technology
In our work, we investigate the training Data Quality (DQ) degradation in Federated Learning (FL) due to malicious attacks against the training data or FL clients or due to technological factors such as software/hardware failures. We develop and propose the Reputation and Trust-based technique that allows detecting local clients who produce local anomalous models, which might be a result of a malicious attack. Using unsupervised models’ parameters clustering (K-means), we analyze local models transferred to the aggregation server. We estimate the distance between the major cluster centers to detect anomalous models. Based on this distance, we calculate our Reputation indicator, which we update in each training iteration. We employ our Trust indicator toward the local clients to exclude them from the aggregation. We demonstrate how this helps to enhance the privacy and performance of the produced global model.
Contextual Understanding of Cybersecurity Exploits/Vulnerability Descriptions through Self-Supervised Learning
Reza Fayyazi, Graduate Student, Rochester Institute of Technology
With the rise and development of the Internet, many systems around the world are susceptible to severe security threats. The volume, variety, and velocity of change in vulnerabilities and exploits have made incident and threat analyses challenging with human expertise and experience alone. Many Security Information and Event Management (SIEM) systems have been developed to produce and correlate intrusion logs and threat intelligence reports to assist security analysts. The description in these logs and reports, however, can be cryptic and not easy to interpret. Therefore, this research aims to assess the complexity and evolution of cybersecurity vulnerabilities and the challenges associated with categorizing them. There is a need to process evolving vulnerabilities to gain an understanding of the tactics and techniques used by adversaries when they are targeting a system. In our preliminary results, we saw that supervised learning models do not reliably identify specific vulnerabilities and exploits. Therefore, to identify useful representations of new/unseen vulnerabilities effectively, we propose a continuous self-supervised learning technique to improve their ability to identify and mitigate cybersecurity threats and enhance their overall security posture. We propose developing metrics to measure the degree of technicality between cybersecurity-specialized sentences. The metrics include: domain-specific word frequency, surprisal to capture rare words in the sentences, and average distance between the predictions of a fine-tuned language model with cybersecurity data and a regular language model. Finally, we propose getting an ensemble evaluation with these metrics to rank the importance of cybersecurity sentences effectively.
Authors: Reza Fayyazi (rf1679@rit.edu), Dr. Shanchieh Jay Yang (jay.yang@rit.edu)
Deepfake Bias: Analysis and Balanced Dataset Generation
Bryce Gernon, Undergraduate Student, Rochester Institute of Technology
Deepfakes are spreading faster than ever, but the detectors made to counter them continue to suffer from systemic bias. The datasets they work with are unbalanced, the training methodologies they use are inherently biased, and the models are often structured without bias in mind.
We exhibit a tool that can automatically analyze how balanced a dataset is and an analysis of many popular deepfake datasets using said tool. We also present experimental results in analyzing model bias and in the usage of multiple specialized models to explicitly account for the visual differences in the perceived race and gender of different faces so that bias can be tracked and prevented.
In addition, due to high levels of perceived bias in popular deepfake datasets, we will show an early version of a new deepfake dataset that is created specifically to minimize imbalance in perceived race and gender. This allows us to analyze the differences in model performance and bias when trained on balanced datasets vs unbalanced datasets.
Additional authors: Saniat Javid Sohrawardi, Matthew Wright
Predicate powered learning with DeepONet
Yang Liu, Graduate Student, Rochester Institute of Technology
The current trend of data-driven learning can be extremely data hungry. The new paradigm of learning is to reduce the amount of data needed by providing useful predicates. Inspired by Vapnik’s work on learning using statistical invariants, this work aims to provide a practical and effective framework to utilize predicates in deep learning. Building upon DeepONet, we propose that by using a universal operator approximator, we can effectively incorporate predicates into the model. We also propose an alternative interpretation of predicates, and its function in the learning problem.
Evaluating Wildfire Detection Sensors with Support Vector Machines
Megan Marra, Undergraduate Student, Rochester Institute of Technology
Support vector machine is a supervised machine learning algorithm for classification and regression. My specific study will use the radial basis function kernel to transform a non-linearly separable dataset into one that is linearly separable. When applied to Earth coverage data (primarily gathered by the MODIS programming sensor), SVM can evaluate the sensor's accuracy in detecting active wildfires. With increasing progress in this field, researchers can track the conditions related to wildfire occurrence and notify local fire departments before the actual event of a wildfire. Thus, reducing the amount of damages caused to the environment, properties, and number of fire-related casualties.
Adapting Transformer Networks for Improved Website Fingerprinting Classification Performance
Nate Mathews, Graduate Student, Rochester Institute of Technology
Website fingerprinting (WF) is a privacy attack that aims to infer a user's browsing behavior by analyzing encrypted network traffic. It is commonly used by surveillance agencies to monitor user activities on the internet. Currently, the most popular state-of-the-art WF attacks use relatively unsophisticated Convolutional Neural Network (CNN) architectures to perform website classification.
In this ongoing project, our goal is to improve the classification performance of website fingerprinting by leveraging the strengths of Transformer networks, which have shown impressive performance on a wide range of natural language processing and image classification tasks. We adapt the state-of-the-art Transformer models used for vision classification to be applicable to our website traffic. Furthermore, we are examining the use of masked input modeling loss to allow for self-supervised training of a WF model on unlabeled traffic samples. Preliminary results have so far shown that these Transformer architectures can achieve competitive performance when compared to the prior CNN architectures.
Overall, our work aims to contribute to the field of website fingerprinting by proposing a novel approach based on Transformer networks that can potentially improve the accuracy and robustness of website fingerprinting classification.
An Automated Post-Mortem Analysis of Vulnerability Relationships using Natural Language Word Embeddings
Benjamin S. Meyers, Graduate Student, Rochester Institute of Technology
The daily activities of cybersecurity experts and software engineers--code reviews, issue tracking, vulnerability reporting--are constantly contributing to a massive wealth of security-specific natural language. In the case of vulnerabilities, understanding their causes, consequences, and mitigations is essential to learning from past mistakes and writing better, more secure code in the future. Many existing vulnerability assessment methodologies, like CVSS, rely on categorization and numerical metrics to glean insights into vulnerabilities, but these tools are unable to capture the subtle complexities and relationships between vulnerabilities because they do not examine the nuanced natural language artifacts left behind by developers. In this work, we want to discover unexpected relationships between vulnerabilities with the goal of improving upon current practices for post-mortem analysis of vulnerabilities. To that end, we trained word embedding models on two corpora of vulnerability descriptions from Common Vulnerabilities and Exposures (CVE) and the Vulnerability History Project (VHP), performed hierarchical agglomerative clustering on word embedding vectors representing the overall semantic meaning of vulnerability descriptions, and derived insights from vulnerability clusters based on their most common bigrams. We found that (1) vulnerabilities with similar consequences and based on similar weaknesses are often clustered together, (2) clustering word embeddings identified vulnerabilities that need more detailed descriptions, and (3) clusters rarely contained vulnerabilities from a single software project. Our methodology is automated and can be easily applied to other natural language corpora. We release all of the corpora, models, and code used in our work.
Accurately Simulating a Medical Search Engine API with GPT
Pranav Nair, Graduate Student, Company: VisualDx | University: Rochester Institute of Technology
Large Language Models (LLMs) promise to replace hand-engineered pipelines and task-specific ML models with “prompt engineering.” Supposedly, a well-designed textual prompt, in concert with an LLM API, can lower the engineering cost of new Natural Language Processing (NLP) services. We investigate a specific use case: extracting ICD10 codes from medical search engine queries. ICD10 codes are a standard ontology for medical diagnoses (Lyme Disease has a code of A69.2). We prompt an LLM to behave like a JSON API that takes in a user query and returns the closest ICD10 code. This is a challenging task because it requires that the LLM not only conform to a pattern, but systematically return the correct answer (historically a challenge for language models). We input 394 real (anonymous) user queries from the VisualDx search engine into GPT 3.5 and GPT 4.0 to see how many of the outputs match our hand-checked ICD codes. We find GPT 3.5 and GPT 4.0 have an exact-match accuracy of 14.7% and 62.7%, respectively. However, a special feature of ICD10 codes is that “partial matches” are informative: If the initial characters of two different codes match, they are related in some way. If we check for exact matches of the portion of the code to the left of the decimal point, GPT 3.5 and GPT 4.0 achieve 47.5% and 84.5% accuracy, respectively. GPT 4.0 with a single prompt is accurate enough to plausibly replace a traditional NLP pipeline for this task.
Authors: Miguel Dominguez, Pranav Nair
Evaluating Deepfake Detector Robustness and Exploring Countermeasures Against Adversarial Deepfakes
Shaikh Akib Shahriyar, Graduate Student, Rochester Institute of Technology
Deepfake (DF) videos are getting better in quality and can be used for dangerous disinformation campaigns. To detect DFs, researchers have designed various models, and sequence-based models that utilize temporal information are more effective at detection than the ones that only detect intra-frame discrepancies. Unfortunately, DF detection is a perfect target for adversarial examples that could fool models and undermine attempts to curb disinformation. Thus, Improving the robustness of the DF detection models is paramount now. We explore whether we can generate adversarial examples that fool sequence-based DF detectors to better understand the threat they pose. Additionally, we explore the effects of different augmentation techniques and adversarial training on DF detectors in improving generalization capability and robustness respectively.
Authors: Shaikh Akib Shahriyar (RIT), Dr. Matthew Wright (RIT)
Exploring User-friendly Explanations for Deepfake Detection
Kelly Wu, Graduate Student, Rochester Institute of Technology
There has been a growing concern about the use of manipulated media. Deepfakes, AI-generated media meant to fool human eyes, have become central to the discussion as the supporting technology has improved. To help people in the battle with deepfakes, many detection techniques have been developed with promising results in the laboratory. Due to the black-box nature of the detection models, however, users may have a hard time understanding the models’ decisions. To bridge the gap between them, we need to provide user-friendly model explanations for deepfake detection. In this work, we explore the performance of existing model explanation methods in explaining the classification results of a deep-learning-based deepfake detection model and identify some insights for future works in explainability for deepfake detection.
Authors: Kelly Wu (RIT), Matthew Wright (RIT), Andrea Hickerson (Ole Miss), Yu Kong (Michigan State)
Applying Linear Probability Models for Binary Outcomes to Epidemiology and the Public Health Sciences
Ann Yao Zhang, Graduate Student, University of Rochester
The convention when analyzing dichotomous outcome variables in the public health sciences is to use the logit, and less commonly, probit models. The linear probability model (LPM) is the application of ordinary least squares to binary dependent variables, and is rarely used in epidemiology. This paper examines methodological criticisms of LPM in epidemiological textbooks and foundational literature and addresses common areas of concern raised by epidemiologists including nonlinearity between exposure and outcome, conditional heteroscedasticity, and predicting probabilities which are less than 0 or greater than 1. Advantages of the LPM are discussed in the context of applications to epidemiological research including the direct interpretation of parameter estimates as mean marginal effects, computational efficiency, and the LPM as the first stage of instrumental variable analysis. Based on applications in econometrics and other fields using large data, a framework for ascertaining the appropriateness of using the LPM for modeling binary outcomes in observational studies in the public health sciences is proposed.