ASA Connect

 View Only
Expand all | Collapse all

Defining Data Science

  • 1.  Defining Data Science

    Posted 12-07-2017 08:43
    The University of Delaware, is focusing on Data Science by adding an Institute of Data Science and creating a M.S. in Data Science.  The Statistics Program is cooperating in both endeavors, but with some concerns.  In that process I have spent a lot of time reading on Data Science and seeking a good working definition.  I am amazed that defining this new thing is not of much concern to others, so almost everything about data is considered data science.  This is my working bullet points about Data Science and I would appreciate comments from others.

    What makes Data Science different from Statistics? Here is my list of key components of Data Science:

    • There usually is not strong theory or prior knowledge driving the analysis.
    • The data are primarily collected for other purposes and what is needed is to obtain it, reorganize it, and manage it.
    • Data manipulation and exploration are often the main focus.
    • Data is often a continuous stream rather than collected at a discrete time period(s).
    • Data may be numbers, but also pictures, objects or text.
    • There are often many, many variables that may not be well defined or understood.
    • The objective is most often to predict something or make a specific decision, rather than theory testing or understanding relationships or associations.
    • Inference may not be an interest of the analysis, nor well understood.
    • It is often business driven, though the military, law enforcement, urban development, health care, and nonprofits are interested.
    • There is a growing social science component


    ------------------------------
    Thomas W. Ilvento
    University of Delaware
    ------------------------------


  • 2.  RE: Defining Data Science

    Posted 12-08-2017 06:59
    Thomas, I would suggest you review some of the previous threads in this community.  Very similar questions about statistics vs data science have been discussed over the last several months, and I think you will glean some good insights from those discussions. There are many points of view on this topic!

    ------------------------------
    Fred Hulting
    Director, Global Knowledge Services
    General Mills, Inc.
    ------------------------------



  • 3.  RE: Defining Data Science

    Posted 12-12-2017 01:58
    Fred,
    Thank you.  I did read previous threads and also looked at a number of panel discussions.  They were helpful.  I wanted to moved toward a definition.  I know that is difficult, but if we were offering a degree in this new thing called Data Science, we should have a good idea of what it is and what it isn't.

    ------------------------------
    Thomas Ilvento
    University of Delaware
    ------------------------------



  • 4.  RE: Defining Data Science

    Posted 12-13-2017 07:55

    Thomas,


    One suggestion is to first define "statistics" or at least "applied statistics".  At our panel discussion on Data science at JSM (see proceedings) we approached it this way, and I think it is instructive.  Applied Statistics is not well-defined either; it is very heterogenous, with lots of unique characteristics depending on the application area, the size of data sets, the desired outcomes, etc.  If you can define that, then defining data science might be easier.


    Fred






  • 5.  RE: Defining Data Science

    Posted 12-08-2017 08:25

    So it sounds like you are putting statistics into a subcategory (more refined) of data science.  Is that what you mean to say?

     

    Susan E. Spruill

    Susan E. Spruill, PStat®

    Statistical Consultant, President

    Applied Statistics and Consulting

    828-467-9184 (phone)

    Professional Statistician accredited by the American Statistical Association

    www.appstatsconsulting.com

     






  • 6.  RE: Defining Data Science

    Posted 12-11-2017 10:40
    Data Science includes many things that are not related to stats. Data base management, programming (for data not analysis), to name a couple.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------



  • 7.  RE: Defining Data Science

    Posted 12-12-2017 10:57
    • "The objective is most often to predict something or make a specific decision, rather than theory testing or understanding relationships or associations."

    How can valid prediction or decision be made without understanding relationships or associations?

    Cheng Cheng, Ph.D.

    Member

    Department of Biostatistics

    St. Jude Childen's Research Hospital

    262 Danny Thomas Place, MS 768

    Memphs, TN 38105

    Tel. 901-595-2935; FAX 901-595-8843

    Email disclaimer: http://www.stjude.org/emaildisclaimer

     






  • 8.  RE: Defining Data Science

    Posted 12-13-2017 07:23
    Cheng,
    I fully appreciate your point.  My training emphasized the use of theory and literature to guide the process of testing and model building.
    My goal is to better understand what this new field is and what it entails.  So far I have great confidence that the field of Statistics (Biostatistics) will always have a future and a role to play in data analysis.  I started out feeling far more worried.  But I also recognize that the Data Science Train is leaving the station and our program needs to be on that train.  I want to better understand this new field and that is why I am reaching out in this thread.

    ------------------------------
    Thomas Ilvento
    University of Delaware
    ------------------------------



  • 9.  RE: Defining Data Science

    Posted 12-15-2017 10:58

    Hi Thomas,

     

    Indeed I share the same concern. About 20+ years ago there was "Data and Knowledge Mining", and in early 2000s there was "Machining Learning for Microarray Data"; we worried about if Statisticians had missed the train. When the hypes died down some, the importance of our profession became more obvious (at least among my collaborators). BUT, definitely this does not entail that we can simply sit on the sideline and see where "Data Science" goes. Many talented Statisticians deeply engaged in the technology and data revolution in the past 20+ years, and we must keep at least the same level of engagement going forward.

     

    Cheng

     




    Email Disclaimer: www.stjude.org/emaildisclaimer
    Consultation Disclaimer: www.stjude.org/consultationdisclaimer





  • 10.  RE: Defining Data Science

    Posted 12-13-2017 08:15
    I think the problem is the word 'decision' to describe the response to data science predictions.  The dictionary defines decision as a "a determination arrived at after consideration".  We need a better word to describe the small actions that occur from the results of data science algorithms because they are not considered decisions. Rather they are algorithmic actions where the contents of the independent variables, often reduced to the word feature, are not of much interest. 

    For example I just went to the New York Times website and was fed an ad for a pack of gift steak knives. This ad was a result of a sequence of bidding and optimizing and actions between servers, including my own. The advertiser spent money for an Internet campaign with some targets but has little control over the details. It is not the same as a classic advertising strategy, and then decision, to buy physical print space in a newspaper.   

     




    ------------------------------
    Georgette Asherman
    ------------------------------



  • 11.  RE: Defining Data Science

    Posted 12-14-2017 14:09
    For a great deal of science, there often is no theory or the theory is just plain wrong. 

    Let's take toxicology for example. Toxicologists don't test for interactions among medications because they believe the interactions don't occur or you can't test for them. Most of their theories assume Design of Experiments doesn't exist because "Everybody knows you can't change more than thing at a time..." Thus you never see warnings on your meds.. oh wait, you do. 

    If you require theory to determine if some effect is real, especially if it's not straight forward, people will die. People have died because no theory existed. 

    Some things to think about; Use theory (at the time) to prove
    1) Mold is an anti-biotic. (Penicillin) 
    2) There's a hole in the Ozone layer.
    3) Why we sleep. 
    4) How does coffee work.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 12.  RE: Defining Data Science

    Posted 12-15-2017 10:37

    The point is can a prediction or a decision generated without understanding any relationship or association be trustworthy.

     

    A substantial portion of scientific studies do not have a theory, in fact they generate theory from properly, effectively, and efficiently exploring the data (to do a data exploration well requires some study design too); and the so generated theory is then tested with further experiments.

     

    I do not work with toxicologists bur I work with pharmacogenomics investigators who routinely study chemo toxicities a great deal, where drug-drug, drug-host  interactions are always critical factors, which are handled by proper experimental designs and statistical modeling. Sometimes, discovering interactions is the goal of the study. "Everybody knows you can't change more than [one] thing at a time..." was already outdated when I was in grad school.

     




    Email Disclaimer: www.stjude.org/emaildisclaimer
    Consultation Disclaimer: www.stjude.org/consultationdisclaimer





  • 13.  RE: Defining Data Science

    Posted 12-16-2017 03:23
    Cheng and Andrew,

    Use of statistics for prediction, decision, or any inference depends on some model, which hopeably is based on some understanding of the context.  A well known statistician, I don't recall which one, I think he was Bayesian, once pointed out that all inference is conditional on the model.

    Of course, predictions and decisions can be made statistically without hypothesis testing.  On the other hand, hypothesis testing has been cast in terms of statistical decision theory.  I currently have much sympathy for the view that statistical theory is statistical decision theory.  There are, of course, statisticians who disagree with this approach.

    But I think my real point is not to confuse any of this conceptually with theories in science, although there rightly is overlap in practice.

    ------------------------------
    Thomas M. Davis
    ------------------------------



  • 14.  RE: Defining Data Science

    Posted 12-18-2017 10:40
    When designing an experiment, you can go a few different ways. 

    If you want to improve a system or method, you can question every step in a process and ask, "Why is this factor set at this level? Why do we do this step?" Let's just say there is a reason why LEAN and 6 sigma are popular in industry. 

    You can say, "These might be realistic levels for an natural system. Let's see what happens when we start making these changes...."

    You can say, "I want to make Product W and start with materials X, Y and Z. Let's see what happens." 

    In science, you tend to develop theory from data. You change your theory when new data contradicts the theory. You might go into an experiment thinking there are some sort of relationships. You might just be interested to see what happens when.... 


    Let's be honest. If you found some relationships that are very predictive but have no theoretical basis, and you can: 
    1) Make lots of money
    2) Save lots of lives
    3) Live better

    You'll use those relationships. You can develop the theory later. Until then, knowing that under a certain set of conditions, some event will occur IS the important part. Remember, science starts with data then goes to theory. Data=> Theory=> (Better data => Better Theory)N

    Also consider that, if someone is a subject matter expert and they already know what will happen with 100% certainty, then there is NO POINT in doing an experiment. The is NO NEED for statistics nor data science.... and there is no hole in the ozone layer. 

    Until then, spurious correlations, odd relationships and "Hhhmmmm. That's interesting." will rule science and we will need statistics and data science to feed that curiosity. We will need statistics to help determine if an effect is "real" or "probably nothing". We will need data science to go through all the data to find those correlations and relationships. 

    If you don't think so, ask yourself this, "Would someone from 10, 20, 50, 100, 200, 500, 1,000 years ago feel the results I have are consistent with their thinking/theory?" or "Where did the theories I use today come from?"

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 15.  RE: Defining Data Science

    Posted 12-18-2017 10:39
    Actually, the idea behind "You can only change one thing at a time..." was wrong back in the 1880's. However, it is the dominate statistical idea in academic research throughout every science discipline. It's completely wrong, but, very popular.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 16.  RE: Defining Data Science

    Posted 12-19-2017 11:07

    That's exactly why scientists need to collaborate with Statisticians; even the classical response surface design deals with (to extent) changing more than one factors simultaneously. And I don't think this thought is very popular among scientists, at least not among my collaborators.

     

    Cheng

     




    Email Disclaimer: www.stjude.org/emaildisclaimer
    Consultation Disclaimer: www.stjude.org/consultationdisclaimer





  • 17.  RE: Defining Data Science

    Posted 12-20-2017 09:12
    I studied Biochemistry, Physics and Environmental Science before getting my master's degree in applied mathematics and applied statistics. The sum total of my undergraduate statistic education was about 6 pages. 

    I asked to take a course from the Industrial Engineering department with my MS Env Sci degree, Design of Experiments. The program advisor said, "If you don't already know how to design an experiment, you shouldn't be in this program!" 

    The issue isn't that scientists don't know much about statistics. It's that they feel the minimal amount of material they covered is more than sufficient.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 18.  RE: Defining Data Science

    Posted 12-21-2017 11:11

    The unfortunate reality is that some (many?) scientists think/believe they know enough Statistics and often confuse Statistics with other quantitative disciplines.

     




    Email Disclaimer: www.stjude.org/emaildisclaimer
    Consultation Disclaimer: www.stjude.org/consultationdisclaimer





  • 19.  RE: Defining Data Science

    Posted 12-12-2017 02:10
    Susan,
    No, I don't think statistics is a subset of data science.  I do think they overlap, and they may assist each other.

    I must confess my first reaction was is data science just statistics on the cheap.  But the more I explored the more I realized if it were the things in my list, then that really wasn't statistics as I was taught.  My training is very much in the social science use of statistics.  If I collect data for a purpose, to test theory or relationships, then that has been around for a long time.  However, if I have a continuous stream of data, and that data is not well understood, and I have lots of variables, and some variable are not numbers but text or pictures, well then that is something very different.  I can recognize that and appreciate it as something very different.

    I laugh when I think if I had over 1,000 variables when I was doing my dissertation my committee would have called me lazy.  I would have been implored to focus on a smaller set of variables, choose based on theory (however defined), and develop hypotheses that could be tested.

    My feeling is that if Data Science is a new discipline, and we are going to train students in that discipline, then we as educators should have a good idea of what the foundations of that discipline are.  I have also looked at what other programs around the country think Data Science is.

    ------------------------------
    Thomas Ilvento
    University of Delaware
    ------------------------------



  • 20.  RE: Defining Data Science

    Posted 12-08-2017 12:12
    I have made this argument before, Data Science  is like a jack-of-all trades, Stats, programming, data design, and, potentially, many other things. In general, a Data Scientist should  know something about each of these areas, but they will more than likely not be a specialist in any of these

    Stats includes many areas, sampling, modeling, theory, times series, ... Each of which can be it's own specialty. To do each of these correctly requires study and experience. Beware of any data scientist who claims to do any one of these well. That is like asking someone with a bachelors in Mathematics to do advanced analysis in real analysis or multidimensional calculus or even numerical analysis.

    If you want someone to design an effective data base then hire a data base specialist. If you want someone to design a computer program to analyze and manipulate large data sets, hire an experienced programmer. If you want any advanced stats analyze (modeling, experimental design, clustering, ...) hire a stats guy with the appropriate specialty.

    Of course, there are always a few exceptions to this guidance. But, IMHO it is a good rule.

    Michael L. Mout, MS, Cstat, Csc
    i
    (Ret)

    MIKS & Assoc. - Senior Consultant/Owner






  • 21.  RE: Defining Data Science

    Posted 12-11-2017 11:38

    It is quite possible for one person to be a both a database expert and a statistics expert.  Perhaps, not often, that person could also be a math expert, a crack software engineer, and adept at various other things!  What's wrong with that?


    BTW statistics are very handy for optimizing database performance.  There is at least one RDBMS out there running with an embedded performance model.






  • 22.  RE: Defining Data Science

    Posted 12-12-2017 11:27
    Of course, there is the rare person who can be expert in many things, but if employers or people who hire contractors expect that from everyone who calls themselves a Data Scientist they will be sorely disappointed.

    And don't get me started on automated black box modeling tools.

    Michael L. Mout, MS, Cstat, Csci
    MIKS & Assoc. - Senior Consultant/Owner
    4957 Gray Goose Ln, Ladson, SC 29456
    804-314-5147(Mbl), 843-871-3039 (Home)





  • 23.  RE: Defining Data Science

    Posted 12-19-2017 09:12
    What makes something a "Black Box"? Is it an assumption that the user doesn't know what the software is doing or how the software is doing it? 

    If that is the case, all software is a "black box". Each piece of software is trying out different algorithms and different implementations of algorithms. Other than small data sets, n < 1,000 rows, I can't think of a formula from a stats book that would be implemented as is. With the new technology, ie multiple cores on a CPU, most of the software algorithms from 10-15 years ago are obsolete. Multithreading is here and makes things run faster.... with enough data.  A good math library, like Intel's MKL, makes most of the material in a typical Numerical Analysis class obsolete. It negates the need to be a great programmer and great computational mathematician in order to do math on a computer well. 

    How often does anyone use something where they don't fully understand what the item/object they are using is doing? Daily, Hourly, Minutely? If I may adapt the folly of many of my pure math professors, "Derive the metabolic pathway(s) for Oxygen/Caffeine/Alcohol. If you can't (prove) it, you can't use it." In other words, every living system is a "black box". Food + Oxygen + Caffeine + Alcohol goes in, a miracle occurs, life continues.   

    BTW, what does it take to become an "expert" in something? I can go to my local community college and take a 3 course sequence in programming, Intro to Java, Object Oriented Java and Data Structures in Java. If I take a fourth class in Comp Sci, I'll take Design of Algorithms. After that class, almost all the other classes in Comp Sci use the ideas taught in those classes. There will be several courses that cover the same "Intro to" material, but for C#, Python/Perl, C, C++, etc. 

    I can take a 3 course sequence in Database systems at that college too, Intro to Databases, Database Administration and NOSQL databases. Those 3 classes cover the same amount of topics found in a Database typical database sequence from a Info Systems department.

    Once you understand algorithmic thinking, especially with multi-threading and high quality math libraries, the combined experience of people with dozens of years of "expertise" becomes obsolete.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 24.  RE: Defining Data Science

    Posted 12-20-2017 16:28

    R package documentation often contains a comprehensive list of technical references, and the source code can be consulted as well.  SAS procedures also have technical reference lists.  So "white box" algorithms are readily available.  And so are resources for growing your own algorithms.

     

    Which is not to say that "black box" usage is acceptable, or uncommon.  What percentage of drivers can replace a flat tire with their spare, for example?






  • 25.  RE: Defining Data Science

    Posted 12-12-2017 02:14
    Michael,
    Good points.  I have heard that for many companies, data science is done in groups with several discipline represented.

    ------------------------------
    Thomas Ilvento
    University of Delaware
    ------------------------------



  • 26.  RE: Defining Data Science

    Posted 12-14-2017 04:20
    Dear all, 

    I spent 3 weeks at the Park City Math Institute in the summer of 2016 with a group of mathematics, computer science and statistics educators trying to decide what Data Science means for undergraduates. We came up with a set of guidelines ( published in the Annual Review of Statistics) which you can access here ( or attached)

    http://www.annualreviews.org/doi/suppl/10.1146/annurev-statistics-060116-053930

    and a set of possible courses here:

    http://www.annualreviews.org/doi/suppl/10.1146/annurev-statistics-060116-053930/suppl_file/st04_de_veaux_supmat.pdf

    These are not intended as "the answer" to what Data Science might mean, but we'd love feedback on the suggestions.

    Thx!






    ------------------------------
    Dick De Veaux
    Williams College
    Past Chair, SLDS section
    ------------------------------

    Attachment(s)



  • 27.  RE: Defining Data Science

    Posted 12-15-2017 09:01
    Thanks, Richard.  These are very useful.

    ------------------------------
    Thomas Ilvento
    University of Delaware
    ------------------------------



  • 28.  RE: Defining Data Science

    Posted 12-20-2017 14:45
    Interesting read. 

    I like that the group decided that they need to include language for courses in outside discipline.

    The way it's laid out, it looks like a BS stats degree with a minor in comp sci. If that was to happen, how would that effect the way upper level stats classes are taught? If students entering the upper level classes already have good knowledge of programming methods, will stats profs still spend as much time discussing what the code does, as opposed to giving the program knowing the students can figure out what the line of code do.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 29.  RE: Defining Data Science

    Posted 12-15-2017 11:33
    I tend to think of Data Science as a pyramid with statistics as the foundation, computational tools in the middle and domain knowledge/communication skills at the top. One of the things that I have found in looking for data science positions is that hiring managers(who generally are not statisticians) tend to view data scientists in the same vein as software engineers.  The interview process is the same for a data scientist as it is for a software engineer (phone screen, take-home exam, on-site technical/fit interview with whiteboard problems). This can lead to massive mismatches where everyone loses. Companies get programmers who don't really understand the assumptions behind the statistical methods (They are just algorithms) and wonder why they are not getting the return on investment they seek.  People well trained in statistics can't get positions because companies are too busy testing to see if you can code in Python, R, SQL, or scala like a software developer. Some companies also expect you to know Spark, and Hadoop as well.

    At the end of the day, statisticians in academia and industry need to be involved in developing data science curricula, teaching DS courses, hiring degreed statisticians and making their voices heard in public forums.  Right now, what you hear from the stats community is a relative silence and that is not good for the profession.  Put it this way, I get more alerts on LinkedIn for data scientist positions than I do for statistician positions, yet the job descriptions are very similar.


    ------------------------------
    Brian Cocolicchio

    ------------------------------