The Computing, Government, and Graphics sections of the ASA are proud to sponsor the annual Data Challenge Expo at the JSM 2025 meetings held in Nashville, Tennessee, from August 2-7, 2025 (community.amstat.org/dataexpo/home).
Challenge
The sky’s the limit!!! Participants in the 2025 Data Expo Challenge were tasked to develop a research question to explore, analyze, and visualize flight arrival and departure data for all commercial flights across the USA. Uncover new insights in the skies as we revisit the 2009 Data Expo, a past favorite. With 16 years more data, this rich dataset offers endless possibilities for creative exploration.This is your opportunity to uncover insights and tell a compelling story through data visualization and innovative analysis. Participants must use the flight data and at least one additional dataset.
You can download the files one at a time by clicking many times on https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr, or, thanks to our friends at DuckDB, you can download all 11 Gb of the data as parquet files with this R code:
base_url <- "https://blobs.duckdb.org/flight-data-partitioned/"
files <- paste0("Year=", 1987:2024, "/data_0.parquet")
for (dir in dirname(files)) dir.create(dir, showWarnings = FALSE)
out <- curl::multi_download(paste0(base_url, files), files, resume = TRUE)
Or this python code:
import os
import urllib.request
from concurrent.futures import ThreadPoolExecutor
base_url = "https://blobs.duckdb.org/flight-data-partitioned/"
files = [f"Year={year}/data_0.parquet" for year in range(1987, 2025)]
def download_file(f):
os.makedirs(os.path.dirname(f), exist_ok=True)
req = urllib.request.Request(base_url + f, headers={'User-Agent': 'Mozilla/5.0'})
with urllib.request.urlopen(req) as response, open(f, 'wb') as out_file:
out_file.write(response.read())
with ThreadPoolExecutor() as executor:
executor.map(download_file, files)
Once downloaded you can use DuckDb, arrow, polars, duckdplyr, or any other you tool you choose to efficiently work with the ~210 million rows.
You can consult the data dictionary to learn the definition of each field.
To get started here are some of the potential questions from the 2009 challenge:
When is the best time of day/day of week/time of year to fly to minimize delays?
Do older planes suffer more delays?
How does the number of people flying between different locations change over time?
How well does weather predict plane delays?
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system? If so, what are they and how do we find them?
As well as few new ones we brainstormed this year:
Can you predict the probability that your flight to the JSM is delayed?
How long did it take for flights to recover from the pandemic? Where there any structural changes in flight routes?
Can you detect to changes in estimated flight times to see if airlines are reducing the appearance of delays by adding some padding to flight times.
Student Winners
FIRST: Producing Estimates of International Migration for U.S. States
Andrew Forrester and Srijeeta Mitra (see pic), both from University of Maryland College Park
SECOND: Demystify Flight Data
Melinda Combs, Bao Anh Maddux, Winston-Salem State University
Professional Winner
The Effect of Delays on Airline Flight Pattern,
Sherry Zhang, Sarah Coleman, Lydia Lucchesi, Saptarshi Roy, University of Texas Austin