The SDNS Section of the ASA sponsors student prizes at the Conference on Data Analysis (CoDA). This year, 36 student posters were evaluated. Judges awarded two honorable mentions as well as a second prize of $100 and first prize of $400. See below for pictures of the winners and more details about their posters.
First Prize: John Dagdelen (Lawrence Berkeley National Laboratory)
Natural Language Processing for Materials Discovery and Design
The majority of all materials data is currently scattered across the text, tables, and figures of millions of scientific publications.
This poster presents the work of our team at Lawrence Berkeley National Laboratory on the use of natural language processing
(NLP) to extract and discover materials knowledge through textual analysis of the abstracts of several million journal articles.
With this data we are exploring new avenues for materials discovery and design such as how functional materials like
thermoelectrics can be identified by using only unsupervised word embeddings for materials. To date, we have used advanced
techniques for named entity recognition to extract more than 100 million mentions of materials, structures, properties,
applications, synthesis methods, and characterization techniques from our database of over 3 million materials science
abstracts. This poster will also present some of the details on how we are making all of this data freely available to the materials
research community through our public-facing website (matscholar.com) and our open-access API.
Second Prize: Kaitlyn Martinez (Colorado School of Mines)
Large Environmental and Demographic Data Sets in Models for Mosquito Borne Disease Risk in Brazil
The spread of mosquito-borne diseases is complex and direct measurements of the fundamental mechanisms of spread is an onerous task.
Therefore, we must turn to proxy data to gain insights into these systems. Fortunately, fewer limitations in data collection and storage have
resulted in an abundance of rich, diverse, and dynamic data sources that measure various aspects of mosquito-borne disease spread.
Weather station measurements and remote sensing of vegetation health can elucidate the status of the mosquito habitat, while demographic
indicators can provide information about the impact of the man-made infrastructure on mosquito spread. Despite the abundance of proxy
data, there are many challenges due to the heterogeneity in these data streams. For example, it is difficult to determine which variables, or
collection of variables, are useful given a large number of observations and potential redundancy. In order to address these challenges, we
developed an iterative dimension reduction method using hierarchical clustering to decrease the number of variables, while maintaining the
intrinsic information of the full data set. Our method is applied to dengue transmission in Brazil and Ecuador and is tailored to maintain
biological interpretation of variable sets driving mosquito-borne diseases such as demographic and environmental data. We find that
socioeconomic factors, temperature, and levels of healthy vegetation are highly predictive of dengue incidence. Our results are consistent
with previous studies that have shown that each of these factors is impacted by climate change in ways that will further increase mosquito-borne
disease incidence around the world. Our study results can inform short-term prevention strategies as well as long term public health
campaigns focused on reducing the overall burden of mosquito-borne diseases.
Joint work with Los Alamos National Lab: Sara Del Valle, Geoffrey Fairchild, Amanda Ziemann, Nidhi Parikh, Carrie Manore
University of Notre Dame: Amir Said Siraj.
Honorable Mention: J.Jake Nichol (Sandia National Laboratories)Using Machine Learning to Compare Simulated and Observed Arctic Climate DataThe extent of sea ice in the Arctic has been declining for decades. Coupled physics-based models (CPMs) forecast less Arctic
decadal sea ice extent (km2/decade) than observed in data, leading to conservative predictions. It is important to identify why
CPMs are too conservative so that we can revise the CPMs and effectively plan for our future climate. To do this, we train
machine learning models on observed data and simulated data separately, then compare the feature importances between the
different models. The approach identifies areas in which the CPMs place different emphasis on features or variables from the
observational model, which may indicate candidate areas for revision. Observed data comes from satellite sea ice concentration
data, along with atmosphere and ocean reanalysis products, in contrast to the simulated data that is generated from the DOE’s
Energy Exascale Earth System Model (E3SM). For both data sets, machine learning models are fit to predict the minimum sea
ice extent in September. Input features from prior months are air temperatures, solar radiation, sea surface temperature, surface
pressure, and wind speeds. The major contribution of this work shows that feature importances in the E3SM model are
inconsistent between runs and inconsistent with the observed data.
Joint work with Matthew Peterson (Sandia National Laboratories), Kara Peterson (Sandia National Laboratories), David Stracuzzi(Sandia National Laboratories).