Resources for Text Analysis

Welcome to the ASA Section on Text Analysis's curated Resource Page. It serves as a hub for statisticians interested in text analysis. It is a living repository of academic papers, tools, tutorials, datasets, educational resources, and community links-all aimed at promoting responsible, effective, and innovative use of text data in statistical research and practice.

Content on this page is community driven. You can help us crowdsource high-quality links and references that can benefit researchers, educators, students, and practitioners across a variety of domains. To contribute, please submit via our Google form: 👉 https://forms.gle/5ibxAN1WChkcbbMV9 

Academic Papers

  • Recent Advances in Text Analysis
    • This paper reviews popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, it reviews Topic-SCORE, a statistical approach to topic modeling, and discusses how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications. The application of Topic-SCORE and other methods to MADStat leads to interesting findings, such as 11 representative topics in statistics, journal ranking, and topic ranking. 

Tools

  • Hugging Face
    • Hugging Face is an open-source platform that provides state-of-the-art pretrained models, datasets, and tools for natural language processing and other machine learning tasks. Its Transformers library enables statisticians to fine-tune or apply modern language models for classification, topic modeling, summarization, embedding generation, and more using Python (portable to R with other tools). It also hosts a large public model and dataset hub, facilitating reproducible research and rapid experimentation.
  • tidytext
    • tidytext is an R package that brings text mining into the tidyverse by representing text as tidy data (one-token-per-row). It provides tools for tokenization, stop-word removal, sentiment analysis, n-gram construction, and integration with dplyr, ggplot2, and other familiar workflows. This design makes it especially accessible for statisticians who want to apply text analysis using standard data manipulation and modeling pipelines in R.
  • tidylda
    • tidylda is an R package for Latent Dirichlet Allocation that is compatible with the "tidyverse" dialect of R programming

Tutorials &  Workshops

  • Primer on mop and generative AI
    • This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.

Educational Resources

  • The Data Science and Predictive Analytics (DSPA) Platform
    • The Data Science and Predictive Analytics (DSPA) platform (at SOCR UMich) includes powerful learning modules and complete end-to-end R electronic Markdown notebooks for text mining, NLP, and statistical learning, and AI prediction (including text, images, and quantitative data).
  • A collection of resources for text analysis (click through and request access)
    • A collection of main resources for text analysis, including foundational academic papers, leading NLP software tools, tutorials, text datasets, and an open-access educational resources to support research, teaching, and practice.

Data Sets

  • Multi-attribute data set on statistics journals
    • This data set contains the text abstracts of 83331 papers in 36 statistics-related journals ranging from 1975 to 2015. [Re: Ke, Ji, Jin, and Li (2023). Recent Advances in Text Analysis. Annal Review of Statistics and Its Applications.]
  • Regulations.gov
    • Regulations.gov is a U.S. federal portal that hosts public comments submitted in response to proposed rules, requests for information (RFIs), and regulatory reviews (e.g., under EGRPRA). Agencies use these comments, alongside other evidence, when revising proposed rules into final regulations. The platform provides large-scale, real-world corpora of policy-related text that are well suited for statistical and computational text analysis to support research and regulatory decision-making.

Communities (Forums, etc.)