National Academies' Report Emphasizes Importance of Inference in Addressing Big Data Challenges

By Steve Pierson posted 07-23-2013 14:41

Recommend

"Frontiers in Massive Data Analysis," a newly released report by the National Academies, makes a strong case for the role of statistics - together with other disciplines - in meeting the challenges of big data. The report was supported by the National Security Agency.

Here are a few key excerpts from the document's summary (you can download the report in its entirety for free here):

the challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) and, instead, hinge on the ambitious goal of inference. Inference is the problem of turning data into knowledge, where knowledge often is expressed in terms of entities that are not present in the data per se but are present in models that one uses to interpret the data. Statistical rigor is necessary to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring statistical principles to bear on massive data. Overlooking this foundation may yield results that are, at best, not useful, or harmful at worst. In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened.
The research and development necessary for the analysis of massive data goes well beyond the province of a single discipline, and one of the main conclusions of this report is the need for a thoroughgoing interdisciplinarity in approaching problems of massive data. Computer scientists involved in building big-data systems must develop a deeper awareness of inferential issues, while statisticians must concern themselves with scalability, algorithmic issues, and real-time decision-making. Mathematicians also have important roles to play, because areas such as applied linear algebra and optimization theory (already contributing to large-scale data analysis) are likely to continue to grow in importance. Also, as just mentioned, the role of human judgment in massive data analysis is essential, and contributions are needed from social scientists and psychologists as well as experts in visualization. Finally, domain scientists and users of technology have an essential role to play in the design of any system for data analysis, and particularly so in the realm of massive data, because of the explosion of design decisions and possible directions that analyses can follow.
Indeed, many issues impinge on the quality of inference. A major one is that of “sampling bias.” Data may have been collected according to a certain criterion (for example, in a way that favors “larger” items over “smaller” items), but the inferences and decisions made may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many subcollections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition. Another major issue is “provenance.” Many systems involve layers of inference, where “data” are not the original observations but are the products of an inferential procedure of some kind. This often occurs, for example, when there are missing entries in the original data. In a large system involving interconnected inferences, it can be difficult to avoid circularity, which can introduce additional biases and can amplify noise. Finally, there is the major issue of controlling error rates when many hypotheses are being considered. Indeed, massive data sets generally involve growth not merely in the number of individuals represented (the “rows” of the database) but also in the number of descriptors of those individuals (the “columns” of the database).

The report was written by the Committee on the Analysis of Massive Data (chaired by Berkeley's Michael I Jordan), under the auspices of the Committee of Applied and Theoretical Statistics and the Board on Mathematical Sciences and Their Applications. The committee listed seven conclusions in their Summary. Since each conclusion is rather lengthy, I'll attempt to paraphrase them by using select excerpts. I apologize to the authors if I misinterpret the conclusions with my excerpts and I encourage readers to see the full text.

"the goals of massive data analysis go beyond the computational and representational issues .... to tackling the challenges of statistical inference, where the goal is to turn data into knowledge and to support effective decision-making. Assertions of knowledge require control over errors, and a major part of the challenge of massive data analysis is that of developing statistically well-founded procedures that provide control over errors in the setting of massive data ..."
"There are many sources of potential error in massive data analysis, many of which are due to the interest in “long tails” that often accompany the collection of massive data. ... the assumptions underlying many classical data analysis methods are likely to be broken in massive data sets."
"Massive data analysis is not the province of any one field, but is rather a thoroughly interdisciplinary enterprise."
"While there are many sources of data that are currently fueling the rapid growth in data volume, a few forms of data create particularly interesting challenges.... human language and speech ... video and image data ... geo-spatial and temporal tags ... networks and graphs ..."
"Massive data analysis creates new challenges at the interface between humans and computers."
"These temporal issues [data sources operate in real time and the desire to make decisions rapidly] provide a particularly clear example of the need for further dialog between statistical and computational researchers."
"There is a major need for the development of “middleware”—software components that link high-level data analysis specifications with low-level distributed systems architectures... The development of massive data analysis systems needs to proceed in parallel with a major effort to educate students and the workforce in statistical thinking and computational thinking." [emphasis mine]

As a first report/white paper emphasizing the importance of statistics to big data that I'm aware of, I think it an important document and I encourage readers to share the report with their networks. It's incumbent upon the statistical community to highlight the role that statistics can play in big data and data science. As noted by others though (e.g., ASA President Marie Davidian in her July Amstat News column), it is a two-way street: it is also the statistical community's responsibility to learn how best to work with the broader big data/data science community. Clearly there are many in our community who have done so (e.g., the statisticians on the NAS Committee on the Analysis of Massive Data) and we should learn from them and others.

In addition to sharing this document with your networks, I welcome your comments on the report and its conclusions/recommendations (either by email or in the comment space below).

See also these previous blog entries relating to big data:

Big Data Week: Got Statistics?, ASA Community Blog Entry, 4/23/13
Big Data and the Role of Statistics, ASA Community Blog Entry, 3/28/12
White House OSTP "Big Data" Initiative Includes Opportunities for Statistics, ASA Community Blog Entry, 3/30/12

[10/25/13 See also these subsequent blog posts:

AAAS Big Data Policy Fellowships: Great Opportunity for Statistical Scientists to Shape Federal Big Data Policy, ASA Community Blog Entry, 8/28/13
Big Data Sessions at JSM, ASA Community Blog Entry, 7/30/13]

See other ASA Science Policy blog entries. For ASA science policy updates, follow @ASA_SciPol on Twitter.

0 comments

130 views

Blogs