CoDA Student Poster Prizes

The SDNS Section of the ASA sponsors student prizes at the Conference on Data Analysis (CoDA). This year, 36 student posters were evaluated. Judges awarded two honorable mentions as well as a second prize of $100 and first prize of $400. See below for pictures of the winners and more details about their posters.

First Prize: John Dagdelen (Lawrence Berkeley National Laboratory)

Natural Language Processing for Materials Discovery and Design

The majority of all materials data is currently scattered across the text, tables, and figures of millions of scientific publications.

This poster presents the work of our team at Lawrence Berkeley National Laboratory on the use of natural language processing

(NLP) to extract and discover materials knowledge through textual analysis of the abstracts of several million journal articles.

With this data we are exploring new avenues for materials discovery and design such as how functional materials like

thermoelectrics can be identified by using only unsupervised word embeddings for materials. To date, we have used advanced

techniques for named entity recognition to extract more than 100 million mentions of materials, structures, properties,

applications, synthesis methods, and characterization techniques from our database of over 3 million materials science

abstracts. This poster will also present some of the details on how we are making all of this data freely available to the materials

research community through our public-facing website (matscholar.com) and our open-access API.

Second Prize: Kaitlyn Martinez (Colorado School of Mines)

Large Environmental and Demographic Data Sets in Models for Mosquito Borne Disease Risk in Brazil

The spread of mosquito-borne diseases is complex and direct measurements of the fundamental mechanisms of spread is an onerous task.

Therefore, we must turn to proxy data to gain insights into these systems. Fortunately, fewer limitations in data collection and storage have

resulted in an abundance of rich, diverse, and dynamic data sources that measure various aspects of mosquito-borne disease spread.

Weather station measurements and remote sensing of vegetation health can elucidate the status of the mosquito habitat, while demographic

indicators can provide information about the impact of the man-made infrastructure on mosquito spread. Despite the abundance of proxy

data, there are many challenges due to the heterogeneity in these data streams. For example, it is difficult to determine which variables, or

collection of variables, are useful given a large number of observations and potential redundancy. In order to address these challenges, we

developed an iterative dimension reduction method using hierarchical clustering to decrease the number of variables, while maintaining the

intrinsic information of the full data set. Our method is applied to dengue transmission in Brazil and Ecuador and is tailored to maintain

biological interpretation of variable sets driving mosquito-borne diseases such as demographic and environmental data. We find that

socioeconomic factors, temperature, and levels of healthy vegetation are highly predictive of dengue incidence. Our results are consistent

with previous studies that have shown that each of these factors is impacted by climate change in ways that will further increase mosquito-borne

disease incidence around the world. Our study results can inform short-term prevention strategies as well as long term public health

campaigns focused on reducing the overall burden of mosquito-borne diseases.

Joint work with Los Alamos National Lab: Sara Del Valle, Geoffrey Fairchild, Amanda Ziemann, Nidhi Parikh, Carrie Manore
University of Notre Dame: Amir Said Siraj.

Honorable Mention: J.Jake Nichol (Sandia National Laboratories)
Using Machine Learning to Compare Simulated and Observed Arctic Climate Data
The extent of sea ice in the Arctic has been declining for decades. Coupled physics-based models (CPMs) forecast less Arctic
decadal sea ice extent (km2/decade) than observed in data, leading to conservative predictions. It is important to identify why
CPMs are too conservative so that we can revise the CPMs and effectively plan for our future climate. To do this, we train
machine learning models on observed data and simulated data separately, then compare the feature importances between the
different models. The approach identifies areas in which the CPMs place different emphasis on features or variables from the
observational model, which may indicate candidate areas for revision. Observed data comes from satellite sea ice concentration
data, along with atmosphere and ocean reanalysis products, in contrast to the simulated data that is generated from the DOE’s
Energy Exascale Earth System Model (E3SM). For both data sets, machine learning models are fit to predict the minimum sea
ice extent in September. Input features from prior months are air temperatures, solar radiation, sea surface temperature, surface
pressure, and wind speeds. The major contribution of this work shows that feature importances in the E3SM model are
inconsistent between runs and inconsistent with the observed data.

Joint work with Matthew Peterson (Sandia National Laboratories), Kara Peterson (Sandia National Laboratories), David Stracuzzi
(Sandia National Laboratories).

Honorable Mention: Ravi Brannon Ponmalai (UC Irvine)
Self-Organizing Maps and Their Applications to Data Analysis
Self-Organizing Maps(SOMs) are a form of unsupervised neural network that are used for visualization and exploratory data analysis
of high dimensional datasets. Our goal was to understand how we can use a SOM to gain insights about datasets. We do this by first
understanding the initialization, training, error metrics, and convergence properties of the SOM. Next we discuss the ways to interpret
and visualize a Self-Organizing Map. Finally we used real datasets to understand what the Self-Organizing Map can tell us about
labeled and unlabeled data. Based on experiments with our datasets we found that the Self-Organizing Map can tell us about the
spacing and position of high dimensional clusters, help us find non-linear patterns, and give us insight into the shape of our data.

Joint work with Chandrika Kamath, Lawrence Livermore National Laboratory. LLNL-ABS-800086 This work was performed under the
auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Lawrence Livermore National Security, LLC.