Big Data and the Role of Statistics

By Steve Pierson posted 03-28-2012 10:35

This piece is co-authored with Ron Wasserstein, Executive Director for the American Statistical Association (ASA).

The White House Office of Science and Technology Policy is hosting a “Big Data event” tomorrow, March 29, at AAAS and will feature representatives from NSF, NIH, NIST, DOD, DARPA and DOE. Other than the speakers, there is little available about what will be announced. We’ll keep you informed as information becomes available. In the meantime, we want your input.

There is no question Big Data has hit the business, government and scientific sectors. Unfortunately, the role of statistics seems too often to be undervalued. Instead, computer science, applied math or other fields are frequently mentioned as the pertinent scientific discipline while statistics is often left out.

The ASA is exploring how it can ensure statistics fulfills its potential in addressing the big data challenges in the industry, government, and science sectors. We’d like your input. Please use the comment space below to give us your ideas for the role of statisticians and the ASA in Big Data (or post your own blog entry).

Some ASA members have already expressed their opinion in Amstat News. In a September 2010 article, “Statistics Ready for a Revolution,” Mark van der Laan and Sherri Rose say the next generation of statisticians must build tools for massive data sets. Their article begins: “The statistics profession has reached a tipping point. The need for valid statistical tools is greater than ever; data sets are massive, often measuring hundreds of thousands of measurements for a single subject. The field is ready for a revolution, one driven by clear, objective benchmarks by which tools can be evaluated. The new generation of statisticians must be ready to take on this challenge.”

ASA Presidents have also been active. Over the past year, ASA President Bob Rodriguez has been giving numerous talks on big data and statistics. His keynote address for the inaugural ASA Conference on Statistical Practice featured the big data challenges for the business community. Earlier this month he gave a presentation at Arizona State University titled, “Business Analytics and Big Data: Is the Statistics Profession Ready?”

In her March 16 AAAS article, “Cutting Edge: Emerging trends in biostatistics,” 2013 President Marie Davidian provides examples of the big data challenges for biomedical and biological science and the opportunities for biostatisticians “to collaborate with the scientists generating the data to develop innovative new theory and methods to tackle problems never envisioned by the biostatisticians of yesterday.”

2010 President Sastry Pantula repeatedly challenged (and continues to do so) the statistical community to play a central role in the data tsunami. In his April 2010 Amstat News column, “Be a Proud Statistician,” he writes, “Data warehousing, retrieving, and mining important information out of the large data sets pose many challenges for the future,” and then poses the question, “Are we training newer statisticians with appropriate analytical, computational, and communication skills as well as new measurement theory and applications?”

As the lead coordinator of this year’s Math Awareness Month, the ASA promoted this theme, “Mathematics, Statistics, and the Data Deluge.” Please visit that website and follow it on Twitter: @MathAware.

Financial services, retail, internet searches, and social media are among the largest big-data drivers in the private sector. One bright spot from the business community was last year’s McKinsey report, “Big data: The next frontier for innovation, competition, and productivity,” which discusses the role of statistics prominently. The following quote appears in the executive summary:

A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data …. we project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions (Exhibit 4).

In the scientific community, Nature and Science magazines have both dedicated issues to data: and and both neglected to mention statistics. On the other hand, the following quote from Robert Tibshirani in the January 26 New York Times “Bits” piece, “What Are the Odds That Stats Would Be This Popular?”, indicates statisticians are playing a central role: “Most of my life I went to parties and heard a little groan when people heard what I did. Now they’re all excited to meet me.”

The Big Data era also has implications for the federal statistical system as noted in this year’s Economic Report of the President:

The growing integration of technology in our daily lives has created an abundance of new possibilities for producing better and more timely data based on nontraditional sources of information. As Census Bureau Director Robert Groves has written, “(t)he volume of data generated outside the government statistical systems is increasing much faster than the volume of data collected by the statistical systems; almost all of these data are digitized in electronic files” (Groves 2012). Nontraditional sources of information include both digital administrative data (e.g., tax records and records related to participation in government transfer programs) and records generated in the private sector (e.g., data from Internet searches, scanner data and social media data).

Post your comments below or as a blog entry on the role of statisticians in Big Data and what the ASA can be doing for statistics to fulfill its potential to help solve the Big Data challenges. We’d also welcome your favorite articles (and/or quotes) on the topic. Watch also for Tweets on the topic: @Ron_Wasserstein; @ASA_SciPol; @AmstatNews; and @MathAware. You can also email comments to Ron Wasserstein and Steve Pierson.



04-20-2012 10:16

Note from Julian Champkin, Editor, Significance Magazine:
Significance Magazine is devoting its August 2012 issue entirely to Big Data. We have many expert authors contributing, including some of those mentioned in Steve and Ron’s blog above; but if you have a view you’d like to share - one that is relevant to general readers as well as to statistically-expert ones - or an example of an unexpected or interesting use of Big Data, then e-mail me at – depending on what happens I may make a digest of contributions, or feature one or more as an article or on our website.
And if you have an idea for a ‘Toolkit’ article, of a technique connected to Big Data that students should known about, let me know also! Our deadline is the last week in May.
Significance is the ASA’s big outlet to the public, which means that it is your voice also. Help us to tell non-statisticians how our profession is central to what is happening in Big Data.
Looking forward to hearing from you,
Julian Champkin,
Editor, Significance, the magazine of the American Statistical Association and Royal Statistical Society

03-29-2012 07:00

For 15+ years, the Association of Computing Machinery and IEEE Computer Society have had conferences, workshops that deal with big data. There are statisticians such as Ed Wegman (among others) who have at the forefront of aspects of big data.
Two big issues are the quality of the data and the confidentiality of the data. If there are errors (often very severe) then the data can not be used in building models or even in computing simple totals. If there are duplicates in the files or the file needs its coverage improved by adding in additional information than record linkage is needed. If the files are used, then individual entities (persons or businesses) need to be assured that the privacy of their personal information is preserved.
The issue is whether the ‘cleaned-up’ original, non-public microdata is of sufficient quality to assure valid analyses for public-policy purposes or simple data mining. Statistical agencies have typically been very good at the methods of data capture (effective form design, good web forms, methods of assuring that information on paper forms is accurately transferred to computer files, etc.). Additional methods of assuring data quality involve modeling/edit/imputation and record linkage. Various outlier-detection methods (or variants such as Hidiroglou-Berthelot 1986) can determine values in records that are unusual. Whether the changes to the values of the ‘outliers’ can be performed in a systematic, valid manner is still a research issue. The Fellegi-Holt model of editing (JASA 1976) also includes elementary imputation methods based on hot-deck. The statistical agencies have implemented generalized systems based on the Fellegi-Holt model but have not generally connected with methods of preserving joint distributions such as in Chapter 13 of Little and Rubin (2002). Record linkage (e.g., Fellegi and Sunter JASA 1969) removes duplicates within files and merges (accurately) information from external sources to improve coverage in a given file.
To assure the confidentiality of microdata and that the ‘masked’ or ‘synthetic microdata provides one or two analyses that correspond to the original microdata, there is much ongoing research.
Still, the primary issue is the quality of the original microdata files. The statistical agencies (and other groups disseminating large data) can efficiently apply valid methods in generalized software . .
The issue of adjusting statistical analyses for linkage error is still a research problem and touched on in the above paper.