I think there's three other issues at play as well that makes a direct comparison very tricky:
1. The methods for "big data" problems and more conventional analyses are often quite different. In a lot of cases, with truly "big" data, you're restricted to either streaming/online/sketching approaches that somehow try to summarize the data prior to modelling it, or you're using fairly simple models (often some sort of reduction to binary classification with a linear model), or both. There's also a marked preference for distribution-free approaches with uniform bounds that hold regardless of the data-generating distribution. This makes it extremely hard to compare to more complex and expressive models that use all of the data at once without significant summarization and often make strong distributional assumptions.
2. Most of the "big data" is locked away; Google and Microsoft, for instance, sit on mountains of it that rarely see the light of day except in broad summary. When they publish research it often suggests that they're using big data/machine learning type approaches, and presumably they've thought to compare those approaches to other 'medium data' approaches, but without being able to see the same data and actually compare methods, it's hard to make a concrete determination. See, e.g., "Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages." (Agrawal, Rahul, et al.2013).
3. A lot of "big data" problems are asking different questions than conventional statistical analyses; they're often phrased as optimization problems rather than questions of significance (although there's some interesting work going on in causality that begins to bridge the gap: Bottou et al. "Counterfactual reasoning and learning systems: the example of computational advertising"; JMLR 2013 is a good read, even if they too skimp on the data).
All together, I think it makes meaningful comparisons of big data to small sample counterparts very hard; it's applying different methods to (often) proprietary datasets to answer different kinds of questions.
Although as a final thought, there's the paper "Classifier technology and the illusion of progress." (Hand, 2006), that makes the point that a law of severely diminishing returns begins to kick in fairly rapidly with respect to the complexity of the method that's being used. I don't think he addressed the question with respect to the size of the data set, but I know that there's a lot of work being done in sublinear learning that suggests that there's a similar thing going on; given some stable generating distribution, at some point you've basically learned it up to some constant, and you get little or nothing from analyzing further examples. As storage and CPU costs continue to drop, the threshold for "big" is going to keep rising, and so I'd expect many 'small sample' methods to come back into favor.
-------------------------------------------
Richard Harang
U.S. Army Research Laboratory
-------------------------------------------
Original Message:
Sent: 11-13-2014 09:10
From: Robert Obenchain
Subject: Strong evidence of better predictive performance when using Big Data
Strong evidence that Big Data analyses can give more realistic answers to key questions is currently just beginning to accumulate. To date, the applications hype and payoff claims are developing much faster than the new statistical thinking needed to get our science there.
Being observational, Big Data typically contain unknown biases. For example, in comparative effectiveness research in healthcare, such biases necessitate focus on non-traditional estimands as well as on their (nonparametric) estimates. Some foundational concepts on local estimation, sensitivity analyses, and predictions that confirm treatment heterogeneity are provided by: "Fair Treatment Comparisons in Observational Research" by Kenny Lopiano, Bob Obenchain and Stan Young, which just recently appeared in Statistical Analysis and Data Mining, Vol. 7 (2014), 376-384.
It is the Volume and Variety of Big Data that makes these new concepts practical; the Velocity of Big Data will also help future patient registry databases accumulate faster.
To realize the potential of Big Data, considerable, sound research is needed to give statisticians the tools they will need to effectively collaborate with subject matter experts.
-------------------------------------------
Robert Obenchain
Principal Consultant
Risk Benefit Statistics LLC
-------------------------------------------
Original Message:
Sent: 11-12-2014 10:00
From: Peter Lachenbruch
Subject: Strong evidence of better predictive performance when using Big Data
Stan Young has been concerned about similar problems in Epidemiology. He would be a good one to comment on this.
-------------------------------------------
Peter Lachenbruch
-------------------------------------------
Original Message:
Sent: 11-11-2014 15:51
From: Roberto Rivera
Subject: Strong evidence of better predictive performance when using Big Data
We've all seen how Big Data has become a big buzzword as of late. And some have also heard about the misuse of Big data prediction methods such as Google Flu trends. Last week I went on a quest to find strong scientific evidence that Big Data prediction methods work. That is, that they perform much better than their small sample counterparts when performing predictions (or maybe inference?). In theory, Big data should be better than small data. But turns out, there are few publications of STRONG success stories of Big Data prediction. Sadly, in my school I don't have access to Web of Science or Scopus, but I'm still surprised to find so few examples with the school database and online.
My question for guests is as follows: Can you provide references to scientific evidence that Big Data prediction methods perform better than their counterparts? Due to the term Big Data being fairly vague, what counts as evidence needs to be carefully defined.
1. When stating Big Data I refer to data, that cannot be analyzed with traditional methods. Hence, thousands of observations is only considered Big Data if it can't be analyzed with traditional methods (e.g. It is difficult to perform spatial prediction with more than 10,000 locations due to inversion of the covariance matrix).
2. Strong evidence is key. For example, one publication uses twitter feeds to predict crime. AUC of most models in the paper are lower than 0.75 representing poor to fair performance. Furthermore, potential use shouldn't count as evidence.
3. Good journals of course, are given preference
-------------------------------------------
Roberto Rivera
Associate professor
University of Puerto Rico Mayaguez
-------------------------------------------