Virtual Workshop with SSS and ASA Professional Development


The Virtual Workshops on Usage of Blended Data and Combining Alternative Source Data: Methods, Challenges, and Considerations in the Era of Big Data launched in 2019 by GSS and SSS as part of the ASA’s Professional Development (PD) Series.  These free workshops were targeted to audiences who may not be able to travel to conferences but are interested in continuing education opportunities.

2019/2020 Virtual Workshop Materials are now on GitHub!

Fall 2020 GSS, SSS, and ASA PD will offer a Virtual Workshop Practicum.  Sessions focusing on user applications will highlight work from career professionals and seasoned practitioners in the area of Blended Data.  In addition, Practicum offerings will include a Student Showcase.   Students and professionals are invited to submit their projects putting blended data techniques into practice for the opportunity to present in the upcoming Virtual Workshop Practicum by supplying a title and brief description (200 word abstract) via Dropbox. The initial Practicum sessions will be September 24 from 1:00 p.m. - 3: 00 p.m. and October 22 from 1:00 p.m. - 3:00 p.m.  Full details on the Virtual Workshop Practicum website.

Topics, Presenters, and Materials:

Overview by Frauke Kreuter, Director of Joint Program in Survey Methodology, University of Maryland
 (slides) (Video accessible via Dropbox)

  • Broad introduction to subject, showing how correctly blended data can lead to enhanced inference.
  • Emphasis on applications in surveys and censuses.
  • Discussions of challenges and some pitfalls/dangers of analyses on blended data including privacy considerations

How Rare is Rare? The Importance of Validation, Professor Aric LaBarr, North Carolina State University's Institute for Advanced Analytics.
 (slides) (R code) (Video accesible via Dropbox)

Abstract: In the growing world of data science and analytics, data is becoming more prevalent and used by organizations from every field. However, we must be careful that the pendulum doesn't swing too far to one side and we start torturing data into saying whatever we want. Much like in clinical trials where we have placebos, we need to properly validate our results and models to better understand if we have a "placebo effect" in our model results and we aren't just getting lucky. This talk highlights the technique of target shuffling, which tries to answer that exact question - what is the probability that my results occurred due to random chance? First used in hedge fund strategy validation, this simulation based technique answers this question for any cross-sectional data problem in an easily interpretable way.

Bio: Aric
 is passionate about helping people solve and analyze problems with their data. Aric is a faculty member at the Institute for Advanced Analytics, the nation's first master of science in analytics degree program. Aric helps innovate the current structure of education to better prepare a modern work force ready to communicate and handle a data driven future. He develops and teaches courses in statistics, mathematics, finance, risk management, and operations research. His focus is to teach, mentor, and serve students as well as businesses on modern analytic techniques for making data driven decisions.

Aric's NC ASA webinar "What Data Science Means to an Executive":

Interpretability vs. Explainability in Machine Learning for High Stakes Decisions, presented by Professor Cynthia Rudin, Duke University 
(Video accesible via Dropbox)

CynthiaRudin Abstract:  With widespread use of machine learning, there have been serious societal consequences from using black box models for high-stakes decisions, including flawed bail and parole decisions in criminal justice. Explanations for black box models are not reliable, and can be misleading. If we use interpretable machine learning models, they come with their own explanations, which are faithful to what the model actually computes. In this talk, I will discuss some of the reasons that black boxes with explanations can go wrong, whereas using inherently interpretable models would not have these same problems. I will give an example of where an explanation of a black box model went wrong, in particular I will discuss ProPublica's analysis of the COMPAS model used in the criminal justice system: ProPublica’s explanation of the black box model COMPAS was flawed because it relied on wrong assumptions to identify the race variable as being important. Luckily in recidivism prediction applications, black box models are not needed because inherently interpretable models exist that are just as accurate as COMPAS. I will give examples of such interpretable models in criminal justice and also in healthcare.

Bio: Cynthia Rudin is an Associate Professor of computer science, electrical and computer engineering, and statistics at Duke University, and directs the Prediction Analysis Lab.  
Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She is the recipient of the 2013 and 2016 INFORMS Innovative Applications in Analytics Awards, an NSF CAREER award, was named as one of the "Top 40 Under 40" by Poets and Quants in 2015, and was named by as one of the 12 most impressive professors at MIT in 2015. Work from her lab has won 10 best paper awards.

Cynthia is past chair of the INFORMS Data Mining Section, and is currently chair of the Statistical Learning and Data Science section of the American Statistical Association. Her research focuses on machine learning tools that help humans make better decisions. This includes the design of algorithms for interpretable machine learning, interpretable policy design, variable importance measures, causal inference methods, new forms of decision theory, ranking methods that assist with prioritization, uncertainty quantification, and methods that can incorporate domain-based constraints and other types of domain knowledge into machine learning. These techniques are applied to critical societal problems in criminology, healthcare, and energy grid reliability.

Introduction to Big Data and Machine Learning for Survey Researchers, by Trent Buskirk, Novak Professor of Data Science, Bowling Green University

  • Overview of Big Data terminology and concepts
  • Introduction to common data generating processes
  • Primary issues in linking Big Data to survey data
  • Pitfalls in inference with Big Data
  • Detailed applications/examples using R and Python

Introduction to Python for Data Science, by Hunter Glanz, Assistant Professor, Department of Statistics, California Polytechnic State University

  • Introduction to Python for data manipulation in preparation for machine learning.
  • Examples using open source government data.
  • Emphasis on data wrangling and exploratory data analysis including visualizations.
Bio: Hunter Glanz is an Assistant Professor of Statistics and Data Science at California Polytechnic State University (Cal Poly, San Luis Obispo). He received a BS in Mathematics and a BS in Statistics from Cal Poly, San Luis Obispo followed by an MA and PhD in Statistics from Boston University. He maintains a passion for machine learning and statistical computing, and enjoys advancing education efforts in these areas. In particular, Cal Poly’s courses in R, SAS, and Python give him the opportunity to connect students with exciting data science topics amidst a firm grounding in communication of statistical ideas. Hunter serves on numerous committees and organizations dedicated to delivering cutting edge statistical and data science content to students and professionals alike. In particular, the ASA’s DataFest event at UCLA has been an extremely rewarding experience for the teams of Cal Poly students Hunter has had the pleasure of advising. 

Differential Privacy, Presented by Matthew Graham, Center for Economic Studies, U.S. Census Bureau
(materials) (Juypiter notebook)

  • Introduction to differential privacy methods, including key vocabulary (what is the privacy budget?)
  • Examples from censuses
  • Privacy considerations when using blended data
Bio: Matthew Graham is the Lead of the Development and Applications Innovation Group within the LEHD Program in the Center for Economic Studies at the U.S. Census Bureau. Over the last decade he has lead teams to develop and implement new confidentiality protection systems, new public-use datasets such as the LEHD Origin-Destination Employment Statistics (LODES) that use those privacy mechanisms, and web-based data dissemination/exploration tools such as OnTheMap, which was a winner in 2010 of U.S. Department of Commerce’s Gold Medal for Scientific/Engineering Achievement. Matthew has an MA in Urban Planning from UCLA as well as an MS in Mechanical Engineering and a BS in Physics from MIT.