Home

ABOUT THE ANNUAL DATA CHALLENGE EXPO

The Annual Data Challenge Expo is jointly sponsored by three American Statistical Association (ASA) Sections – Statistical Computing, Statistical Graphics, and Government Statistics. The 2025 Data Challenge Expo will be held in conjunction with JSM 2025 in Nashville, Tennessee from August 2 - 7, 2025.

PARTICIPATION

The challenge is open to students and professionals from the private or public sector. Using statistical and visualization tools and methods, contestants will analyze the given data set(s). 

 

AWARD CATEGORIES

There will be two award categories:

  • Professional (one level with a $500 award) 
  • Student (three levels with awards at $1,500, $1,000, and $500)

 piece of paper with Rules lines and check marks

To enter, contestants must do the following by February 3, 2025.

  • Submit an abstract for a contributed Speed Poster session to the JSM 2025 website. Specify the Statistical Computing Section as the primary sponsor.
  • Note:  The period for submitting contributed abstracts is December 2, 2024 to February 3, 2025.
  • Forward the JSM abstract submission email with abstract number, title, and authors to Wendy Martinez (wendy.l.martinez@census.gov).

The abstract is a placeholder to ensure the contestant is included in the JSM 2025 program. Contestants will present their work in a speed poster session and judging will be based on the results of the analysis presented at the JSM in August 2025. 

Presenters are responsible for their own JSM registration and travel costs, and any other costs associated with JSM attendance. Group submissions are acceptable. Following JSM, contestants may submit a paper describing their analysis and results to Chance Magazine.

cloud image with a bar graph and arrow


The sky’s the limit!

Participants in the 2025 Data Expo Challenge will develop a research question to explore, analyze, and visualize flight arrival and departure data for all commercial flights across the USA. Uncover new insights in the skies as we revisit the 2009 Data Expo, a past favorite. With 16 years more data, this rich dataset offers endless possibilities for creative exploration.This is your opportunity to uncover insights and tell a compelling story through data visualization and innovative analysis. Participants must use the flight data and at least one additional dataset.

 

You can download the files one at a time by clicking many times on https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr,

or,

thanks to our friends at DuckDB, you can download all 11 Gb of the data as parquet files with:

R Code

base_url <- "https://blobs.duckdb.org/flight-data-partitioned/"

files <- paste0("Year=", 1987:2024, "/data_0.parquet")

for (dir in dirname(files)) dir.create(dir, showWarnings = FALSE)


out <- curl::multi_download(paste0(base_url, files), files, resume = TRUE)

Python Code

import os

import urllib.request

from concurrent.futures import ThreadPoolExecutor


base_url = "https://blobs.duckdb.org/flight-data-partitioned/"

files = [f"Year={year}/data_0.parquet" for year in range(1987, 2025)]


def download_file(f):

    os.makedirs(os.path.dirname(f), exist_ok=True)

    req = urllib.request.Request(base_url + f, headers={'User-Agent': 'Mozilla/5.0'})

    with urllib.request.urlopen(req) as response, open(f, 'wb') as out_file:

        out_file.write(response.read())


with ThreadPoolExecutor() as executor:

    executor.map(download_file, files)

Once downloaded you can use DuckDb, arrow, polars, duckdplyr, or any other you tool you choose to efficiently work with the ~210 million rows.


You can consult the data dictionary to learn the definition of each field.

To get started here are some of the potential questions from the 2009 challenge:

  • When is the best time of day/day of week/time of year to fly to minimize delays?
  • Do older planes suffer more delays?
  • How does the number of people flying between different locations change over time?
  • How well does weather predict plane delays?
  • Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system? If so, what are they and how do we find them?

As well as few new ones we brainstormed this year:

  • Can you predict the probability that your flight to the JSM is delayed?
  • How long did it take for flights to recover from the pandemic? Where there any structural changes in flight routes?
  • Can you detect changes in estimated flight times to see if airlines are reducing the appearance of delays by adding some padding to flight times?

line

We encourage you to collaborate by sharing resources on identifying other interesting datasets to use as a part of your exploration. Join the Data Expo Challenge Slack workspace and share your questions and answers!

 CONTACTS

For questions on the ASA Data Challenge Expo please reach out to Donna LaLonde (donnal@amstat.org) Wendy Martinez (wendy.l.martinez@census.gov