ASA Connect

 View Only
  • 1.  Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-21-2017 10:16
    Edited by Kelly Zou 08-23-2017 09:54

    Interesting topic in recent years: "Data Science: The End of Statistics?"

    http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A64495

     

    ASA's Statement:

    http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science

     

    "50 Years of Data Science" by Prof. David Donoho:

    https://www.r-statistics.com/2016/01/50-years-of-data-science-by-david-donoho

     

    "A recent and growing phenomenon is the emergence of 'Data Science' programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M "Data Science Initiative" that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments."

     

    Your thoughts on this topic?



  • 2.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-22-2017 12:00
    Before anyone reads anything by Vincent Granville (the author of "Data
    Science: The End of Statistics?"?? a web link posted by Dr. Zou), you
    should look at Andrew Gelman's comments about Dr. Granville:

    http://andrewgelman.com/2014/12/13/dont-dont-dont-dont-brothers-mind-unblind/

    As far as the emergence of "Data Science" programs, the more the
    merrier. They may be a bit too enthusiastic of their methods and a bit
    too critical of traditional statistical methods, but we can handle that.

    There is a spot, though, where we should draw the line. There are some
    people selling "snake oil" versions of data science. Thankfully, it's
    pretty easy to spot them. They make grandiose and unsupportable claims
    and fail to address the limitations of their methods. It's kind of like
    those "doctors" who peddle cure-alls that work for everyone with no side
    effects and no contraindications.

    Most data scientists are not selling snake oil so to them I say (quoting
    Bruce Willis from the first Die Hard movie) "Welcome to the party, pal!"

    --
    Steve Simon, mail@pmean.com
    I'm blogging now! blog.pmean.com




  • 3.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-22-2017 13:28
    I found the 2013 piece lacks a certain thoughtfulness.  Contrasting statistics as 'slow and methodical' v. big data as 'fast and flashy' tells me that the writer has conflated science with marketing.  While statistics as a science has its roots elsewhere, the core of scientific tradition from Francis Bacon to the present is based on the question: "How do you know?"  Statistics is in that tradition, not the art of the deal.





  • 4.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-23-2017 02:07
    This sounds like something the panel I was on discussed at JSM2017.

    A lot of statisticians don't like "data science" because "sound" statistical theories are different and sometimes contradictory to data science theories. Both have flaws. Let's get over ourselves and work together.

    A lot of statisticians will point out how data science has messed up analyses. Data scientists can do the same for traditional statistical models. Data scientists can "prove" statisticians are wrong. Statisticians can "prove" data scientists are wrong. How about we admit we can do better and work together? (On a personal note, I can go back and reanalyze a lot of the data we used in my stats classes using "data science techniques" and found data science did either the same or better in explaining the data.... unless I reject some of the ideas of "good modeling techniques" and make models with interactions, etc. Then they get better.)

    Take missing data for example. How does statistics handle it and why does it do that? For a data scientist, a lot of the methods used couldn't care less about missing data points or bad data. Who's theories are "correct"? Let's stop fighting and work to make things better. 

    Take the type of modeling done with data. A statistician will essentially hand select variables and terms that belong in the model. A statistician will try to make the model as simple as possible. Why? Is the world a simple place? If so, why isn't cancer cured? Why don't we know exactly why all diseases occur? Why can't we predict with great accuracy when someone will get the disease? (My mother's doctor's couldn't tell the difference between a viral infection vs stage 4 cancer. Even her highly rated oncologist would have thought virus.) 

    The data scientists will let the data and the algorithms decide what belongs and what doesn't. If a simple model is sufficient, then the algorithm will tell you. If not, it will tell you. I think George Box said it best, "Listen to the data!"

    If you look in some data mining textbooks, they discuss traditional statistical techniques. I've searched through multiple stats books. I don't remember any mentions of CART and Random Forest models in my regression textbooks. Why?   

    One of the biggest differences between "data science" and traditional statistics is the use of technology. Statisticians use Proc SQL or something similar to manipulate data on their desktop and use it as a database system. (they're not)  A data scientist will use actual database systems to manipulate data (and does it far faster and better). A statistician will use data sets that "fit" on their desktops. The data scientist will use a tablet to talk to the server and use all the data possible.... and do it faster and better.

    Data Scientists can learn from statisticians. Statisticians can learn from data scientists. How about we work together? 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 5.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-23-2017 17:44
    Edited by Angel D'az 08-23-2017 17:55
    This is a very interesting topic and there are great replies already. I have a limited perspective, only having spent a few years in the statistics+data world and as an undergrad.

    I think that a lot of "data science" and what a data scientist does, computational mathematics, is not mathematically rigorous enough maybe for communication's sake. Communicating rigorous computational mathematics is difficult. As an example; how to explain hypothesis testing without falling back on a binary "Reject - Do Not Reject" outcome? How to simplify concepts without dumbing them down? I think this strikes at the biggest flaws in data science, as described by the older statistical world here. Statistics has largely been in a bubble, according to Tukey in one of your articles, and has not had the need for a lot of statisticians to simplify without dumbing down. Across the board, statisticians are trained to communicate with statisticians, students to teachers. Of course there are consultancy programs but I don't think general audience consumption is an integral value in the pedagogy - as John Chambers, Bill Cleveland and Leo Breiman say. I may be completely wrong, having very little experience and only being an undergrad.

    Having said that, I completely agree with another reply here, "The more, the merrier". It's awesome that statistics is expanding, despite not in the most perfect way according to the old guard, or whatever they/you would like to be called/call them.

    ------------------------------
    Angel D'az
    Metrics Analyst Intern
    Micron Technology
    ------------------------------



  • 6.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-24-2017 20:31
    This is an interesting topic. The tension is real. And territorial concern about the statistical field is felt by statisticians. Statisticians can learn from big data, new technologies, and emerging fields. Statisticians should view Data science as a new opportunity instead of a threat.  

    This report from AAPOR about big data is a nice piece taking a perspective of survey statistics. It clarifies how the survey world actually need more research to take advantage of big data and data science techniques. There are new opportunities presented by big data.
    http://www.aapor.org/Education-Resources/Reports/Big-Data.aspx

    It is clear that Data Science is a multi-disciplinary field, and anyone relevant will benefit from collaboration across the disciplines. However, it takes a lot of effort to embrace the differences and work together towards something bigger. 

    A news today shows that there are such efforts going on to bring everyone together. 

    NSF Awards $17.7 Million in Funding for 12 Transdisciplinary Research in Principles of Data Science (TRIPODS) projects.
    New NSF awards will bring together cross-disciplinary science communities to develop foundations of data science | NSF - National Science Foundation





    ------------------------------
    Yueyan Wang
    ------------------------------



  • 7.  RE: Your Thoughts on this Topic - "Data Science: The End of Statistics?"

    Posted 08-26-2017 14:20
    New Issue of the magazine Impact( it is a magazine of the Operational Research Society) include an article of Geoff Royston entitled Small Data.  Geoff claims that in analytical and management circles there is much talk about Big Data nowadays. As he explains, the landscape of digital world features vast ranges of data mountains thrown up by business transactions,public services and social communication. Computers and analytical techniques allow these to be mixed, matched and mined rapidly and extensively; in order to find trends,patterns and connections in such areas as: costumer purchases, population health or even popular culture.

    But are more data always an answer?
    To awnswer the above question let's look at the story:

    Chief economist of the bank of England,Andy Haldane, has recently reported that 'Michael Fish' moment'  of failing to predict the bank crash of 2008 highlighted a crisis in economic forecasting and that only big data could bring about a transformation of economic  forecasting in the same way it has in improving forecasting the weather. But could it? The answer is No. Because it was not a data problem. Cnsequently, what was needed to gauge the risk of financial storm was better understanding of some basic statistical concepts and a proper,realistic financial model. And , as in any financial  bubble, there were behavioral factors at play too,which always make economic forecasting even more uncertain business than predicting the weather. Big data, valuable though as it undoubtedly is,will not be enough. 

    It is a shame, but I do not have time to summarize the whole of the article, as much as I would love to.

    But here are main points made by Geoff:

    The story of big data is an inversion of the story of statistics. It is so because, a key concept in statistics is that it is not necessary to measure all of a large population in order to establish its key features - a sample should generally suffice.

    Some work has been devoted in recent years  on how to make the best use of very small and cost-effective ,samples.

    The advent of big data has sometimes been take to indicate that, as huge volumes of dataof all varieties are can now be so easily and cheaply collected - small data, maybe even disciple of statistics- are no longer important. But there are a lot of small data problems that occur in big data:
    - irrelevance( much big data is passively found,whereas small data is actively sought)

    -errors( of collection or recording)

    -noise( finding a needle in a haystack)

    -sampling bias(another thing to do with 'found' nature of much  big data;even if your data comes from  usage record of 50 million smarphones you are still sampling only smartphone users)

    -false positives(while all car owners may buy things at garages,not every body who buys at garage owns a car)

    -historical bias(the  past in not always a good basis on which to predict future, especially in turbulent time)

    - multipe-comparision hazards(test big data set for enought relationships and some associations will come up eventually)

    - risk of confusing correlation with causation(increases in autism correlated with increases in vaccination- but there is no casual link.

    I made a photo of the artcle for those who want to read it. See atachemnts.




    ------------------------------
    Robert Pieczykolan
    ------------------------------