ASA Connect

 View Only
Expand all | Collapse all

Strong evidence of better predictive performance when using Big Data

  • 1.  Strong evidence of better predictive performance when using Big Data

    Posted 11-11-2014 15:52
    We've all seen how Big Data has become a big buzzword as of late. And some have also heard about the misuse of Big data prediction methods such as Google Flu trends. Last week I went on a quest to find strong scientific evidence that Big Data prediction methods work. That is, that they perform much better than their small sample counterparts when performing predictions (or maybe inference?). In theory, Big data should be better than small data. But turns out, there are few publications of STRONG success stories of Big Data prediction. Sadly, in my school I don't have access to Web of Science or Scopus, but I'm still surprised to find so few examples with the school database and online.

    My question for guests is as follows: Can you provide references to scientific evidence that Big Data prediction methods perform better than their counterparts? Due to the term Big Data being fairly vague, what counts as evidence needs to be carefully defined.

    1. When stating Big Data I refer to data, that cannot be analyzed with traditional methods. Hence, thousands of observations is only considered Big Data if it can't be analyzed with traditional methods (e.g. It is difficult to perform spatial prediction with more than 10,000 locations due to inversion of the covariance matrix).
    2. Strong evidence is key. For example, one publication uses twitter feeds to predict crime. AUC of most models in the paper are lower than 0.75 representing poor to fair performance. Furthermore, potential use shouldn't count as evidence.
    3. Good journals of course, are given preference

    -------------------------------------------
    Roberto Rivera
    Associate professor
    University of Puerto Rico Mayaguez
    -------------------------------------------


  • 2.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-12-2014 10:01
    Stan Young has been concerned about similar problems in Epidemiology.  He would be a good one to comment on this.

    -------------------------------------------
    Peter Lachenbruch
    -------------------------------------------




  • 3.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-13-2014 09:11

    Strong evidence that Big Data analyses can give more realistic answers to key questions is currently just beginning to accumulate.  To date, the applications hype and payoff claims are developing much faster than the new statistical thinking needed to get our science there.

    Being observational, Big Data typically contain unknown biases. For example, in comparative effectiveness research in healthcare, such biases necessitate focus on non-traditional estimands as well as on their (nonparametric) estimates. Some foundational concepts on local estimation, sensitivity analyses, and predictions that confirm treatment heterogeneity are provided by: "Fair Treatment Comparisons in Observational Research" by Kenny Lopiano, Bob Obenchain and Stan Young, which just recently appeared in Statistical Analysis and Data Mining, Vol. 7 (2014), 376-384.

    It is the Volume and Variety of Big Data that makes these new concepts practical; the Velocity of Big Data will also help future patient registry databases accumulate faster.

    To realize the potential of Big Data, considerable, sound research is needed to give statisticians the tools they will need to effectively collaborate with subject matter experts.

    -------------------------------------------
    Robert Obenchain
    Principal Consultant
    Risk Benefit Statistics LLC
    -------------------------------------------




  • 4.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-13-2014 10:25
    I think there's three other issues at play as well that makes a direct comparison very tricky:

    1.  The methods for "big data" problems and more conventional analyses are often quite different.  In a lot of cases, with truly "big" data, you're restricted to either streaming/online/sketching approaches that somehow try to summarize the data prior to modelling it, or you're using fairly simple models (often some sort of reduction to binary classification with a linear model), or both.  There's also a marked preference for distribution-free approaches with uniform bounds that hold regardless of the data-generating distribution.  This makes it extremely hard to compare to more complex and expressive models that use all of the data at once without significant summarization and often make strong distributional assumptions.

    2. Most of the "big data" is locked away; Google and Microsoft, for instance, sit on mountains of it that rarely see the light of day except in broad summary.  When they publish research it often suggests that they're using big data/machine learning type approaches, and presumably they've thought to compare those approaches to other 'medium data' approaches, but without being able to see the same data and actually compare methods, it's hard to make a concrete determination.  See, e.g., "Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages." (Agrawal, Rahul, et al.2013).

    3. A lot of "big data" problems are asking different questions than conventional statistical analyses; they're often phrased as optimization problems rather than questions of significance (although there's some interesting work going on in causality that begins to bridge the gap: Bottou et al. "Counterfactual reasoning and learning systems: the example of computational advertising"; JMLR 2013 is a good read, even if they too skimp on the data).

    All together, I think it makes meaningful comparisons of big data to small sample counterparts very hard; it's applying different methods to (often) proprietary datasets to answer different kinds of questions. 

    Although as a final thought, there's the paper "Classifier technology and the illusion of progress." (Hand, 2006), that makes the point that a law of severely diminishing returns begins to kick in fairly rapidly with respect to the complexity of the method that's being used.  I don't think he addressed the question with respect to the size of the data set, but I know that there's a lot of work being done in sublinear learning that suggests that there's a similar thing going on; given some stable generating distribution, at some point you've basically learned it up to some constant, and you get little or nothing from analyzing further examples.  As storage and CPU costs continue to drop, the threshold for "big" is going to keep rising, and so I'd expect many 'small sample' methods to come back into favor.

    -------------------------------------------
    Richard Harang
    U.S. Army Research Laboratory
    -------------------------------------------




  • 5.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-13-2014 12:51
    In my opinion, the one advantage of Big Data is the ability to ID niches within which profitable or efficient solutions can be identified due to the sample size.

    For example, you have have a sample of 100K within which you find a 0.1% sample that has a unique property (100% response, high profitability, consistent reaction to meds, ...); however, that is only 100 individuals which may or may not be profitable or efficient to treat. With a equivalent sample of 10 Mil that sample may or may not be consistent, but if it is then you have 10K individuals and it may well be profitable or a treatable sample.

    -------------------------------------------
    Michael Mout
    MIKS
    -------------------------------------------




  • 6.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-13-2014 16:35
    I'm guessing but I'm wondering if you are using cross validation test your models. Use a small representative sample and validate on larger portion. Simple significance is worthless because everything is significant when you are using big date. I'm not sure about problems with inversion of the covariance matrix. The matrix is based on the number of variables, not on the number of observations. But the technique of validation should help with the problem.

    Also traditional methods of say regression really don't work as well as say bootstrap.

    Trevor Hastie is at the leading edge. Some of his work is difficult for me to follow but he covers the case where you have more dependent variables than independent variables.

    http://web.stanford.edu/~hastie/lectures.htm

    I hope this helps. Big data is a different world. It is a cross between computer science and statistics. More statisticians need to be in the fray because engineers think only of methods not on basics like representative sample. What population do you wish to make inference.

    I'm not sure where you are getting the .75. Is that a P value.

    I hate to write in an open forum and I hope this goes to you.

    Barbara




  • 7.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-13-2014 14:17
    You may be interested in this note posted by Kesten Green on some other lists (this version is from the INFORMS list but it's not as though Dr. Green is trying to keep this secret).
    ---------------------------------------------------------------------

    Subject: Last call for evidence that complex forecasting methods are better

    My new article with Scott Armstrong presents evidence that complexity increases forecast error. Have we missed any evidence that might challenge that conclusion?

    The article, "Simple Forecasting: Avoid Tears Before Bedtime" proposes that simplicity in forecasting requires (1) method, (2) representation of cumulative knowledge, (3) relationships in models, and (4) relationships among models, forecasts, and decisions are all sufficiently uncomplicated as to be easily understood by decision makers. Our review of studies comparing simple and complex methods has found 93 comparisons in 28 papers. Complexity beyond the sophisticatedly simple failed to improve accuracy in all of the studies and increased forecast error by an average of 32 percent in 21 studies with quantitative comparisons.

    The effects are so consistent and substantial that we are concerned that we might have overlooked disconfirming evidence. If you know this area, please look at the references in the paper to see if we have overlooked any key studies, and send me your suggestions. The paper is available on the simple-forecasting.com page.
    ---------------------------------------

    And yes, this is the same Kesten Green (and J. Scott Armstrong) who authored the (in)famous paper Andrew Gelman blogs about here:  http://andrewgelman.com/2013/11/25/interesting-flawed-attempt-apply-general-forecasting-principles-contextualize-attitudes-toward-risks-global-warming/

    -------------------------------------------
    Michael Kruger
    Information Resources Inc
    -------------------------------------------




  • 8.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-15-2014 07:12
    I believe that we statisticians must exploit the trendiness of Big Data, Data Analytics, and Data Science, to promote further involvement of statisticians in the decision making process.Current applications of Big Data methods that are the most obvious to me to be truly beneficial are in the creation and quick use of massive databases (e.g. Sloan digital sky survey). But it is not obvious to me that the benefits will generally occur in terms of inference (prediction and understanding of processes). Admittedly, an intriguing field in Big Data is the analysis of unstructured data.

    Some threads in this discussion have already brought up some important points. For example there's been so much focus on Big Data, that the need for Good Data has been forgotten. There's been proposals to use proprietary data, such as Google Trends data, without knowing how the data is gathered (Google only states that query data is based on 'samples' of total queries). Also there is much evidence of simple statistical methods outperforming complex methods, though this is not always the case.

    By the definition of Big Data being used here (data that cannot be analyzed with traditional methods) it is not possible to compare performance of Big data prediction methods with traditional methods with the common tools (based on forecasting errors, etc.). Yet often, it is argued that Big Data methods are better than traditional methods since they rely on more data. Hence, we still must find ways to compare performance. Perhaps through return of investment metrics, or earnings growth in the case of business problems.

    The Big Data revolution is occurring while for the most part us statisticians are sitting on the sidelines. However, if the Big Data trend fails, and the way things are going there is a 'good chance' of this partially occurring, statisticians will probably be blamed.


    -------------------------------------------
    Roberto Rivera
    Associate professor
    University of Puerto Rico Mayaguez
    -------------------------------------------




  • 9.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-17-2014 10:30
    All of us with statistical training have been disturbed by the rise of "BIG DATA" as a cure all.   Often, the proponents have little orno understanding of statistical theory.  But, they have to be answered.  D.J. Finney once wrote about the statistician whose client comes in and says, "Here is my mountain of trash.  Find the gems that lie therein."  Finney's advice was to not throw him out of the office but to attempt to find out what he considers "gems".  After all, if the trained statistician does not help, he will find some one who will.  Frank Anscombe once warned that all large data sets contain "will of the wisps", strange coincidences that have no predictive power to similar sets of data.   One solution to all this was proposed by R.A. Fisher and later by John Tukey.  Divide the data at random into two subsets.  Use one subset to generate hypotheses, the other subset to test them.  When I taught elementary statistics, I'd present the students with a table of random numbers and watch while they found clear "non-random" patterns in it.  Regarding this request for studies that show non-statistical methods are useful, I fear that the hunt for them will run against publication bias.   Only the success stories will be publicized.

    -------------------------------------------
    David Salsburg
    -------------------------------------------




  • 10.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-18-2014 08:38
    David: This is very interesting. I emphasize the importance of hold-out sample validation in my recent book Willful Ignorance. I was unaware (but am not surprised) that Fisher and Tukey would have taken this position. Can you please send citations to any writing they may have done about this? Thank you.

    -------------------------------------------
    Herbert Weisberg
    President
    Causalytics, LLC
    -------------------------------------------




  • 11.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-20-2014 07:43
    David - I second Herbert's comment - I would be very interested in seeing what Fisher wrote about this -

    -------------------------------------------
    Peter Bruce
    Statistics.com
    -------------------------------------------




  • 12.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-18-2014 10:52
    David Draper proposes going a step further than David Salsburg's suggestion -- what he calls
    "calibrated cross-validation" (CCV). CCV involves partitioning the data into three sets, M for modeling, V for validation, and C for calibration. M is used to explore plausible models, V to test them, iterating the explore/test process as needed. Then fit the best model  (or use Bayesian model averaging) using MUV, reporting both inferences from this fit and quality of predictive calibration of this model in C.

    See http://www.ma.utexas.edu/blogs/mks/2014/04/28/david-draper-on-bayesian-model-specification-toward-a-theory-of-applied-statistics/ for a brief summary of this and other suggestions of Draper's, with a link to his lecture notes on these suggestions.

    -------------------------------------------
    Martha Smith
    University of Texas
    -------------------------------------------




  • 13.  RE: Strong evidence of better predictive performance when using Big Data

    Posted 11-19-2014 16:24
    Dear Roberto,

    Data Science is much ballyhooed but should not be dismissed as just hype, IMO.  FYI, here are some sources that I have found very useful:

    - Introduction to Data Science (Stanton and De Graaf)
    - Field Guide to Data Science (Booz Allen Hamilton)
    - Data Mining Techniques (Linoff and Berry)
    - Handbook of Statistical Analysis and Data Mining Applications (Nisbet et al.)
    - Data Mining (Whitten et al.)
    - An Introduction to Statistical Learning (James et al.)
    - Elements of Statistical Learning (Hastie et al.)

    The last four are widely-cited in the data mining community and Hastie et al. is almost regarded as a bible.  (I think it's superb.)

    Regards,

    -------------------------------------------
    Kevin Gray
    Cannon Gray LLC
    -------------------------------------------