In commemoration of Big Data Week April 22-28, I'm sharing a short draft document I'm working on regarding the role of statistics in big data. Clearly, progress from and in big data will be advanced by many scientific disciplines ranging from data mining and computational science to statistics and machine learning.
As American Statistical Association (ASA) Executive Director Ron Wasserstein and I wrote in a joint March 2012 blog entry, Big Data and the Role of Statistics, the scientific discipline of statistics can sometimes be left out of the big data discussions. Numerous people have also discussed this and such discussion has also extended to the related term "data science" as evidenced by the recent blog entries by Larry Wasserman and Jeff Leek of SimplyStatistics.
With statistics being the science of collecting, analyzing, and understanding data, as well as accounting for the relevant uncertainties and making decisions in the presence of that uncertainty, there is no doubt statistics should play a key role in big data and data science.
I frequently hear from ASA members about the low visibility and/or appreciation of statistics. It's clear that we in the statistics community must do a better job of highlighting what statisticians brings to the table and how statistics makes the science better. One ASA member emailed suggesting a document paper contrasting a series of studies/experiments with the original design side by side with the statistician-refined design, showing such things as gains in precision and reduction of sample sizes. I love the idea of such a document to show the value added by statisticians in the scientific process. I've heard another member comment that achieving a higher visibility for statistics is the equivalent of a culture change.
The bottom line of course is that it is the statistical community's responsibility to achieve the higher visibility for our scientific discipline by emphasizing that statistics and statisticians make the science better.
To help achieve this higher visibility for statistics with big data policymakers, I've developed this draft Statistics/Big Data one-pager with input from a variety of sources and ASA members. The document starts with brief definitions of big data and statistics and then goes into the many big data issues that statistics can help address: missing data, data quality, multiple sources of data, observational nature of data, uncertainty quantification, etc. The document also explains the scientific approach that statisticians bring to bear and the many scientific skills that statisticians possess: assessing and correcting for bias; measuring uncertainty; designing studies and sampling strategies; assessing the quality of data; enumerating limitations of studies; dealing with issues such as missing data and other sources of non-sampling error; developing models for the analysis of complex data structures; creating methods for causal inference and comparative effectiveness; eliminating redundant and uninformative variables; combining information from multiple sources; and determining effective data visualization techniques.
I'd welcome any comments or suggestions to make this a stronger document. (I'm most interested in substantive issues but I certainly won't ignore design comments.)
[10/25/13 See also these subsequent blog posts:
See
other ASA Science Policy blog entries. For ASA science policy updates, follow @ASA_SciPol on Twitter.