Discussion: View Thread

Reaching out to Data Scientists

  • 1.  Reaching out to Data Scientists

    Posted 07-11-2016 16:02

    Hello Everyone,

    There's been a lot of discussion on the net about what a data scientist is, whether we're data scientists, whether statistics is dead, and so on. One of the things that's clear when you read posts by data scientists and for data scientists is that, depending on their background, many data scientists are relearning (and sometimes ignoring), what Statistics teaches. There are also many data scientists who, even if they know their statistics, are not involved in the American Statistical Association or like societies. The ASA published a statement on the Role of Statistics in Data Science (http://www.amstat.org/misc/DataScienceStatement.pdf) last October. The final line says “The ASA aims to facilitate collaboration between statisticians and other data scientists and thus enable them to achieve more than they could on their own.”

    Ron Wasserstein, Executive Director of ASA, discussed the statement in his blog back then (http://community.amstat.org/blogs/ronald-wasserstein/2015/10/01/the-role-of-statistics-in-data-science-an-asa-statement) and outlined some of the ASA’s efforts to “facilitate further collaboration between statisticians and other data scientists.” As Chair-Elect for our section and as a member of the Committee on Applied Statistics, I’m interested in what you do to facilitate this collaboration.

    • Do you collaborate with others who call themselves data scientists?
    • How successful is that collaboration?
    • What makes it successful?
    • Is there anything specific to the statistician / data scientist collaboration that you would not find in other collaborations?
    • How well do they know their statistics?
    • Do you try to raise their statistical capabilities?
    • Do you learn anything from them?
    • Do any of the ASA’s initiatives impact you in your collaboration? How?
    • What other ways could we reach out to data scientists, individually and as a Section, to increase the relationship between us in mutually beneficial ways?

    I look forward to your responses. Thank you in advance for sharing your experiences and ideas.

    Chuck

    ------------------------------
    Chuck Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com
    ------------------------------


  • 2.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 09:53

    Perhaps a better idea would be to have an open discussion about the differences and similarities in approaches between traditional statistics and "data scientists".

    As someone that started with "traditional" statistics and moved into "Data Mining", I can tell you that a lot of the methods "Data Scientists" use are statistical methods. A lot of the methods are based upon multivariate statistical methods.Though, they are not normally taught in a stats class. Some of the methods are too new, like from the 70's and 80's;-)

    One of the big things data scientists do differently, especially if they are coming from a computer science background, is use more efficient methods for data collection and analysis. To a data scientist, a large data set involves millions of tuples of data. Big data is trillions of tuples of data.For a statistician, we tend to be limited to data sets that are small enough to fit on our desktops. Data Scientists use servers for their analyses. So, no data set is "too big" unless its too big to fit on their servers. 

    Often times too, data scientists want to predict outcomes. They don't really care about a "one unit change in X" and it's effect on Y. Data Science tends to force you to put things in perspective. Suppose that the odds ratio for some factor is 1000. The data scientist would say, 1000 times what? If there is a 1 in a billion chance of something happening and increasing a value by 1 increases that chance by 1000, you now have a 1 in a million chance. If that increase of 1 rarely happens or is some type of coding for "gender", race, else, if you are not of that gender or race, it has no effect for you. If you are part of that gender or race, there is a small chance of it being a meaningful change anyways, who cares?  

    If you look at a lot of the traditional stats methods, they have been around for decades and tend to be based upon making hand calculations easier to do. Data Mining methods like CART models and Random Forests are computationally expensive. Statisticians try to make simple (linear) models. Random Forests allow the data to speak. RF models can be very complex. They create dozens, if not hundreds of models and average them together, just like some robust regression models. They also involve a lot of "interactions". Just liked designed experiments. But, many statisticians try to hold true to parsimony. Which begs the question, if everything really was simple enough that simple equations can model everything, why didn't physics stop after Newton? Why would someone need to invent stochastic partial differential eqns???? Or even calculus for that matter. Unfortunately, we can't model all behavior with F=ma, PV=nRT, and Y = mX +b. 

    While statisticians are trying to stick with parsimony, data scientists want sufficiency. 

    I'll even lay out a challenge for the debate between traditional statistics and "data science". Take 10 textbook data sets (I've already done this with a lot of my logistic regression data sets.) Split the data 70%/30%. Use that 70% to train the model and the 30% to test the model's predictions. Then use RF and Neural Networks and use the same data. Tune your models as needed and see which method(s) are best. Of the 50 or so data sets I have done this with, RF and NN beat logistic regression by a lot. Then do the same thing with non-parametric methods. 

    I would also ask the traditional biostatistician, "How important is it to know about the effect of a 'one unit increase' in X versus being able to accurately predict the outcome of a patient?" Keep in mind, you can use your RF or NN model with lots of different combinations of factors and levels to see what happens when...... 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 3.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 10:09
    My perspective may be a little skewed as I shifted careers into biostatistics from neuroscience, and lately I'm under well-meaning pressure to present myself as a data scientist.

    I think the distinction between explanatory and predictive models is a very important one. While clinicians want relatively simple tools (such as the Charlson Comorbidity Score), they also want to be able to know what the tools mean in context. 

    But I'm loath to make that the dividing line between statistician and data scientist. 

    I think I agree with the view that statistical analysis is one component of "data science," others being database management and graphical presentation.

    Yet I'm still blocked on whether to start calling myself a data scientist.   This may be petty prejudice, but I'd rather explain than predict.  So who am I, besides a very silly person who plays with computers?








    ------Original Message------

    Perhaps a better idea would be to have an open discussion about the differences and similarities in approaches between traditional statistics and "data scientists".

    As someone that started with "traditional" statistics and moved into "Data Mining", I can tell you that a lot of the methods "Data Scientists" use are statistical methods. A lot of the methods are based upon multivariate statistical methods.Though, they are not normally taught in a stats class. Some of the methods are too new, like from the 70's and 80's;-)

    One of the big things data scientists do differently, especially if they are coming from a computer science background, is use more efficient methods for data collection and analysis. To a data scientist, a large data set involves millions of tuples of data. Big data is trillions of tuples of data.For a statistician, we tend to be limited to data sets that are small enough to fit on our desktops. Data Scientists use servers for their analyses. So, no data set is "too big" unless its too big to fit on their servers. 

    Often times too, data scientists want to predict outcomes. They don't really care about a "one unit change in X" and it's effect on Y. Data Science tends to force you to put things in perspective. Suppose that the odds ratio for some factor is 1000. The data scientist would say, 1000 times what? If there is a 1 in a billion chance of something happening and increasing a value by 1 increases that chance by 1000, you now have a 1 in a million chance. If that increase of 1 rarely happens or is some type of coding for "gender", race, else, if you are not of that gender or race, it has no effect for you. If you are part of that gender or race, there is a small chance of it being a meaningful change anyways, who cares?  

    If you look at a lot of the traditional stats methods, they have been around for decades and tend to be based upon making hand calculations easier to do. Data Mining methods like CART models and Random Forests are computationally expensive. Statisticians try to make simple (linear) models. Random Forests allow the data to speak. RF models can be very complex. They create dozens, if not hundreds of models and average them together, just like some robust regression models. They also involve a lot of "interactions". Just liked designed experiments. But, many statisticians try to hold true to parsimony. Which begs the question, if everything really was simple enough that simple equations can model everything, why didn't physics stop after Newton? Why would someone need to invent stochastic partial differential eqns???? Or even calculus for that matter. Unfortunately, we can't model all behavior with F=ma, PV=nRT, and Y = mX +b. 

    While statisticians are trying to stick with parsimony, data scientists want sufficiency. 

    I'll even lay out a challenge for the debate between traditional statistics and "data science". Take 10 textbook data sets (I've already done this with a lot of my logistic regression data sets.) Split the data 70%/30%. Use that 70% to train the model and the 30% to test the model's predictions. Then use RF and Neural Networks and use the same data. Tune your models as needed and see which method(s) are best. Of the 50 or so data sets I have done this with, RF and NN beat logistic regression by a lot. Then do the same thing with non-parametric methods. 

    I would also ask the traditional biostatistician, "How important is it to know about the effect of a 'one unit increase' in X versus being able to accurately predict the outcome of a patient?" Keep in mind, you can use your RF or NN model with lots of different combinations of factors and levels to see what happens when...... 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 4.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 11:46

    Hello Mitchell,

    Your journey is interesting. You can be an ambassador! :-)

    I don't want to get sidetracked, so let me just say briefly that I don't think we can characterize data scientists as a homogeneous population any more than we can do that for statisticians. I think that as a Data Scientist you can both explain and predict, however you'd like, unless your boss says different. :-)

    Back to the original question, since you are making that journey, albeit a bit hesitantly, how do you see us reaching out to data scientists? What methods, venues, communication means, etc, can we use?

    Thanks Mitchell. I look forward to your input.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 5.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 12:03

    "a very silly person who plays with computers" may be the very best description of what I do for money that I've ever heard.

    Thank you for this.

    ------------------------------
    Jason Brinkley
    American Institutes for Research



  • 6.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 12:22
    Based on my entirely personal angle, there is an obstacle to
    collaboration between "us" and "them": it is that the very term "Data
    Scientist" sounds ostentatious and even offensive. Apparently, the
    great majority of people who are true scientists (those who do some
    innovative, groundbreaking research) do not have the word "scientist"
    in their title. Yesterday a fellow PhD in Statistics suggested that I
    should change the title in my LinkedIn profile to "Data Scientist"
    because that's "the new name for statisticians who can program" (yes,
    I work in the industry and I can program quite well). Of course, I am
    not changing anything: even though I do some research at work, it's
    not enough to deserve the title of scientist. I am sure that the great
    majority of "them" deserve it even less.

    Is there a solution, given that the buzzword is already entrenched?
    Well, I guess the Data Science community (if there is one) could issue
    a memorandum to let the world know that they consider the term "Data
    Science" an unfortunate misnomer and, even though they are forced to
    display "Data Scientist" on their business cards, they can't help
    being mildly ashamed about that. That will go a long way towards
    productive collaboration.

    Regards,
    Nik Tuzov, PhD

    ------Original Message------

    "a very silly person who plays with computers" may be the very best description of what I do for money that I've ever heard.

    Thank you for this.

    ------------------------------
    Jason Brinkley
    American Institutes for Research
    ------------------------------


  • 7.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 12:34
    Although the original question was "how do statisticians make contact with data scientists" I think we've converged on the point that it's not useful to open a dialogue between "us" and "them" without knowing what defines both groups. Otherwise, we may find ourselves in the situation Walt Kelly summed up as "We have met the enemy and he is us."

    But getting back to the original point, my first thought would be that a useful approach would be to find an interesting problem that made substantial demand on technical skills, so that it couldn't be defined as purely a statistician or a data scientist problem.  But then how do we keep statisticians from being overshadowed by the data science umbrella?  (I'm fairly sure that is not a mixed metaphor)




    ------Original Message------

    Based on my entirely personal angle, there is an obstacle to
    collaboration between "us" and "them": it is that the very term "Data
    Scientist" sounds ostentatious and even offensive. Apparently, the
    great majority of people who are true scientists (those who do some
    innovative, groundbreaking research) do not have the word "scientist"
    in their title. Yesterday a fellow PhD in Statistics suggested that I
    should change the title in my LinkedIn profile to "Data Scientist"
    because that's "the new name for statisticians who can program" (yes,
    I work in the industry and I can program quite well). Of course, I am
    not changing anything: even though I do some research at work, it's
    not enough to deserve the title of scientist. I am sure that the great
    majority of "them" deserve it even less.

    Is there a solution, given that the buzzword is already entrenched?
    Well, I guess the Data Science community (if there is one) could issue
    a memorandum to let the world know that they consider the term "Data
    Science" an unfortunate misnomer and, even though they are forced to
    display "Data Scientist" on their business cards, they can't help
    being mildly ashamed about that. That will go a long way towards
    productive collaboration.

    Regards,
    Nik Tuzov, PhD



  • 8.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 11:34

    Hi Andrew,

    Thanks for the response. Good thoughts. Yes, it will be beneficial at some point to understand the differences and similarities between the approaches. However, the first step is to be able to engage in that conversation. What ways can we reach out to Data Scientists, particularly those who are not part of the ASA or similar organization right now?

    It seems to me that we have to reach "across the aisle," so to speak and invite them to the table. This has to be done in a non-confrontational, collaborative way as allies. Then there has to be a reason for them to be interested in talking to us, some mutually beneficial reason for the discussion.

    It seems to me that we can learn some tips for this if we understand ways that statisticians and data scientists work together now. What ways do they collaborate now? Of course, that's assuming that they do collaborate now.

    Do you collaborate with data scientists? What makes that successful?

    If you don't, how would you reach out to them to begin the discussion?

    Thanks again. I look forward to your thoughts.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 9.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 10:29

    The Ann Arbor ASA group has quite a few folks that would consider themselves Data Scientists. They also have groups of traditional biostatisticians and applied statisticians. It's interesting when we get into our monthly discussions. I fit in with the applied statisticians group and am making headway into the Data Science realm. 

    I would think within statistics itself, there is a bit of a division between "mathematical statisticians", "Biostatisticians" and "applied Statisticians". For example, I took Design of Experiments and Adv Design of Experiments. I allow my data to talk to me. Quite often, I find "significant" interactions among my factors. Usually, there are more interactions than main effects. To the other applied statisticians and the data scientists, interactions are normal, every day occurrences. To the biostatisticians, interactions are appalling. The "mathematical statisticians" question whether not not you really need all those interactions.

    When we have had some presentations on biomedical data analysis, (Patient outcomes, etc), the biostatisticians tend to make simple models that intentionally leave out some factors (like doctor, hospital, lab, # surgeries the doctor had before the patient, time of surgery, etc). For the QC/Reliability statisticians and industrial engineers in the group, we ask, why didn't you include.....? For an IE, human factors plays an important role in the quality of a product. Ignoring HF, is a cardinal sin.

    The mathematical statisticians tend to see things from their perspective, and usually teach the "proofs" courses in the stats curriculum. I've had some interesting conversations with these types of statisticians. They feel "proof is truth". When you confront them with real data based on reality, and you can show they are wrong, they tend to get mad. If you use an alternate perspective for the same problem, they get mad too.

    When you add in "data scientist" to the mix and the common denotation that they are "statisticians that can program" you add in another layer to that discussion. A lot of the companies in the area want "data scientists" and they claim a data scientists knows how to collect and manage the data they need from a server (IT) and analyze the data using various predictive analytics (statistics). They want someone with a degree in stats and a minor in MIS/IT/Comp Sci or major in Comp Sci and a minor in stats.

    I think we do need to define what a data scientist really is. A "statistician that can program" is a bit too vague. I think all statisticians use R, SAS or some other language to write programs and analyze data. Based upon the local companies, they would not hire someone with these skills as a data scientist.

    I think a better definition for a data scientist is someone that understands:

    1) the computer algorithms the software is using

    2) the concepts of databases and data warehousing

    3) the statistical methods available for analysis

    4) they need a full toolbox of methods, not a limited selection of "popular" methods

    A traditionally trained statistician can fulfill Item 3 and part of item 4. (I don't know many statisticians that took Data Mining, Data WArehousing or a course on Database systems). A computer scientist can generally handle items 1 and 2 and part of item 3. (I know that a 5-page section on linear regression in a Data Mining textbook is not sufficient.) An MIS/IT person will be knowledgeable about items 1 and 2.

    While there is some overlap between a computer scientist and a statistician or a comp sci and IT, I know no one group is best. But, there are areas where all 3 groups can get much better. Statisticians need to get better at handling and analyzing large data sets. Comp Sci needs to understand there is more to analysis than Random Forests and Neural Networks. IT needs to understand there is more to analysis than taking a mean and a standard deviation.   

    That might be how we can all come together. We can work with each other to make each other better. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 10.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 14:02
    Hi Andrew,

    I really appreciate your comments because I think it highlights where the "distinction that makes a difference" is between data science and statistics. You have pointed out that what data science has done a lot of work in developing are algorithmic methods for prediction. That is, data science is concerned with taking a data set (of any size) and efficiently coming up with predictions from it. I think these are wonderful tools, and many of them knock the sock out of the broken null-hypothesis-significance-testing framework. But really this kind of just seems like a bit of a rehashing of the Frequentist vs Bayesian debate, only now its Frequentist vs Bayesian vs Data Science. But that's not a fair fight because only Frequentists and Bayesian have thought deeply about the concept of probability.

    You see, when I think of statistics I don't necessarily think of a body of methods. Statistics is the science of uncertainty. We emphasize data at our peril. Data is not what we should care about, or rather it some a means to an end, not an end in itself. From my point of view, whether inference ("explanation") or prediction is your goal, if you are start thinking about your problem after you already have data, then you're probably in trouble.

    The strength of Statistics as a field is not in its methods for analyzing data, but rather in considering the pitfalls in the collection of data and the design of experiments. Where can statistics and data scientists come together? How about when it comes to a conversation on sampling? If your sample is biased because of how you collected your data (no matter how massive) then it is useless for either prediction or inference. Yes you can build a wonderful prediction model with a neural net based on a massive database of Twitter users, but your scope of inference is...Twitter users. Good luck making out-of-sample predictions. 

    So as a semi-rejoinder to your question, I would ask the data scientist: "How important is it to predict whether a Twitter user has the flu, when your goal is predict the spread of flu in the general population?"

    --
    Dalton Hance, M.S.
    (541) 231-9474

    "That would be a good world, free and out-doors.
    But the vast hungry spirit of the time
    Cries to his chosen that there is nothing good
    Except discovery, experiment and experience and discovery: To look
    truth in the eyes,
    To strip truth naked, let our dogs do our living for us
    But man discover.
    It is a fine ambition,
    But the wrong tools. Science and mathematics
    Run parallel to reality, they symbolize it, they squint at it,
    They never touch it: consider what an explosion
    Would rock the bones of men into little white fragments and unsky the world
    If any mind for a moment touch truth."
    -robinson jeffers
    "The Silent Shepherds"


    ------Original Message------

    Perhaps a better idea would be to have an open discussion about the differences and similarities in approaches between traditional statistics and "data scientists".

    As someone that started with "traditional" statistics and moved into "Data Mining", I can tell you that a lot of the methods "Data Scientists" use are statistical methods. A lot of the methods are based upon multivariate statistical methods.Though, they are not normally taught in a stats class. Some of the methods are too new, like from the 70's and 80's;-)

    One of the big things data scientists do differently, especially if they are coming from a computer science background, is use more efficient methods for data collection and analysis. To a data scientist, a large data set involves millions of tuples of data. Big data is trillions of tuples of data.For a statistician, we tend to be limited to data sets that are small enough to fit on our desktops. Data Scientists use servers for their analyses. So, no data set is "too big" unless its too big to fit on their servers. 

    Often times too, data scientists want to predict outcomes. They don't really care about a "one unit change in X" and it's effect on Y. Data Science tends to force you to put things in perspective. Suppose that the odds ratio for some factor is 1000. The data scientist would say, 1000 times what? If there is a 1 in a billion chance of something happening and increasing a value by 1 increases that chance by 1000, you now have a 1 in a million chance. If that increase of 1 rarely happens or is some type of coding for "gender", race, else, if you are not of that gender or race, it has no effect for you. If you are part of that gender or race, there is a small chance of it being a meaningful change anyways, who cares?  

    If you look at a lot of the traditional stats methods, they have been around for decades and tend to be based upon making hand calculations easier to do. Data Mining methods like CART models and Random Forests are computationally expensive. Statisticians try to make simple (linear) models. Random Forests allow the data to speak. RF models can be very complex. They create dozens, if not hundreds of models and average them together, just like some robust regression models. They also involve a lot of "interactions". Just liked designed experiments. But, many statisticians try to hold true to parsimony. Which begs the question, if everything really was simple enough that simple equations can model everything, why didn't physics stop after Newton? Why would someone need to invent stochastic partial differential eqns???? Or even calculus for that matter. Unfortunately, we can't model all behavior with F=ma, PV=nRT, and Y = mX +b. 

    While statisticians are trying to stick with parsimony, data scientists want sufficiency. 

    I'll even lay out a challenge for the debate between traditional statistics and "data science". Take 10 textbook data sets (I've already done this with a lot of my logistic regression data sets.) Split the data 70%/30%. Use that 70% to train the model and the 30% to test the model's predictions. Then use RF and Neural Networks and use the same data. Tune your models as needed and see which method(s) are best. Of the 50 or so data sets I have done this with, RF and NN beat logistic regression by a lot. Then do the same thing with non-parametric methods. 

    I would also ask the traditional biostatistician, "How important is it to know about the effect of a 'one unit increase' in X versus being able to accurately predict the outcome of a patient?" Keep in mind, you can use your RF or NN model with lots of different combinations of factors and levels to see what happens when...... 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 11.  RE: Reaching out to Data Scientists

    Posted 07-18-2016 10:40
    Hi Andrew,
    I am glad this is being discussed because I have been having issues with those who believe that data science is separate from statistics. I believe that data science is a sub-specialty of statistics. First, I want to mention that my doctorate is in statistics and due to my position at a large university where, of course there is big data, I have had to learn data mining techniques to create predictive models for various uses at my institution.  I have spent a good deal of time reading and studying in order to get up to speed on data mining methods, which were not part of the curriculum when I was a student. The data mining methods use statistical tests, many of them traditional, to arrive at the models.  There are chi square tests, F tests, use of correlations, eigenvalues, Bonferronni adjustments, and on and on, depending on the methods being used. These are "behind the scenes" and the user can often pick and choose to fit the circumstances.  With the models come a multitude of diagnostic measures to evaluate the fit of the models.  When I was studying all of the this (and my studies continue because the data mining methods are many), I was clearly studying statistical methods.  I found that many decisions need to be made in configuring the models. I do not believe that I could have navigated all of this to do justice to the data mining models I have been working on without my advanced degree in statistics.  The big difference is that the focus of data science is using large datasets to create predictive models that are accurate enough to be used to predict future outcomes as opposed to focusing on hypothesis driven results.  However, just because data mining methods are not hypothesis driven and are used on big datasets does not mean that it is not statistics and data scientists are not statisticians.  The data mining methods all utilize statistics in the background as I have pointed out.  I have attended some workshops where the instructor focused on writing code to get out the results, and their poor modeling results showed what happens when someone does not have a clear understanding of the choices one must make to configure the modeling methods and properly use diagnostic tests to evaluate the results.  You cannot just push a button, let it rip, and expect a wonderful model.  Because the use of big data has necessitated the cooperation of computer programmers or data professionals does not mean that those data professionals are data scientists, unless they really have a grasp of the statistical underpinnings of the data mining methods to the extent that they can use that knowledge in the modeling process. With the ability to process large amounts of data and the advent of methods capable of utilizing it, the statistics profession has broadened to include the new statistics based data mining methods.


    ......................................................


    Nora Galambos, PhD

    Senior Data Scientist

    Institutional Research, Planning & Effectiveness

    Stony Brook University

    Office: 631.632.1591

    nora.galambos@stonybrook.edu





    ------Original Message------

    Hi Andrew,

    I really appreciate your comments because I think it highlights where the "distinction that makes a difference" is between data science and statistics. You have pointed out that what data science has done a lot of work in developing are algorithmic methods for prediction. That is, data science is concerned with taking a data set (of any size) and efficiently coming up with predictions from it. I think these are wonderful tools, and many of them knock the sock out of the broken null-hypothesis-significance-testing framework. But really this kind of just seems like a bit of a rehashing of the Frequentist vs Bayesian debate, only now its Frequentist vs Bayesian vs Data Science. But that's not a fair fight because only Frequentists and Bayesian have thought deeply about the concept of probability.

    You see, when I think of statistics I don't necessarily think of a body of methods. Statistics is the science of uncertainty. We emphasize data at our peril. Data is not what we should care about, or rather it some a means to an end, not an end in itself. From my point of view, whether inference ("explanation") or prediction is your goal, if you are start thinking about your problem after you already have data, then you're probably in trouble.

    The strength of Statistics as a field is not in its methods for analyzing data, but rather in considering the pitfalls in the collection of data and the design of experiments. Where can statistics and data scientists come together? How about when it comes to a conversation on sampling? If your sample is biased because of how you collected your data (no matter how massive) then it is useless for either prediction or inference. Yes you can build a wonderful prediction model with a neural net based on a massive database of Twitter users, but your scope of inference is...Twitter users. Good luck making out-of-sample predictions. 

    So as a semi-rejoinder to your question, I would ask the data scientist: "How important is it to predict whether a Twitter user has the flu, when your goal is predict the spread of flu in the general population?"

    --
    Dalton Hance, M.S.
    (541) 231-9474

    "That would be a good world, free and out-doors.
    But the vast hungry spirit of the time
    Cries to his chosen that there is nothing good
    Except discovery, experiment and experience and discovery: To look
    truth in the eyes,
    To strip truth naked, let our dogs do our living for us
    But man discover.
    It is a fine ambition,
    But the wrong tools. Science and mathematics
    Run parallel to reality, they symbolize it, they squint at it,
    They never touch it: consider what an explosion
    Would rock the bones of men into little white fragments and unsky the world
    If any mind for a moment touch truth."
    -robinson jeffers
    "The Silent Shepherds"




  • 12.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 13:21

    Hey Nora,

    One of the interesting things I have found in dealing with computer scientists vs statisticians when it comes to analytics, we have different theories on data and how to handle data. That means we have different perspectives.

    For example, if you had a messy 2,000,000 tuples data set and could "clean" it up and get 50,000 tuples of "good data" most statisticians would take the 50,000 tuples of data. We learned about GIGO. The computer scientists I spoke to would take 2,000,000 tuples of data. Their theory says more data=>better results. To some extent, I agree with this idea more than GIGO. Depending upon how you "clean" the data, you could be getting rid of valuable information that is actually important in the discussion. 

    It's all about the perspective you bring to the analysis.

    I will say that when I saw how computer scientists used linear regression and the like, I was appalled. In the course of analyzing "textbook" data using standard stats methods and data mining techniques, I think I prefer many of the data mining techniques. Though, I'm not happy with how the data mining methods are used either.

    Over the course of my studies and practice, I've had to develop my own rules for data analysis. Like, using multiple "cross-validation" or "training and validation" sets. If you have 1,000 tuples of data, you can split that data 60/40 or 70/30. I'll create 5-10 models using different subsets of training and validation data and see what each model tells me. If they are all in agreement, or nearly so, I'm happy. If not, I'm still happy because I just learned something and have some interesting data to deal with. Most folks won't bother with the multiple models. If I am analyzing data from a designed experiment, I look at R^2(predicted) and use the model I have to predict the outcome of the data I have and look at how they fit the confidence interval for each design point. I usually let this be the tie-breaker if I have several models I could use for the data. 

    Should I have the need to optimize my model, that brings about a whole new set of theories from Operations Research which are generally ignored or unknown to statisticians. The big issue is that OR assumes the coefficients are deterministic. In statistics, we know the coefficients are stochastic. So, statisticians violate a main assumption of OR. On the other hand, OR assumes that only one coefficient can change at a time. (During WW2, this assumption made calculations with pencil and paper much easier. Now, it's just silly. Statisticians know that OFAT is bad.)

    Again, it's all about the perspective you bring to the analysis. 

    With greater breadth of knowledge comes greater depth of knowledge. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 13.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 13:42

    Hiya everyone,

     

    Just some comments from the peanut gallery before I go back and answer the bullets this evening…

     

    Looking over the bulleted items, I see a place for data science and statisticians to interact in areas where there are key differences.

    Specifically, one can turn around these bullets

    • How well do they know their statistics?
    • Do you try to raise their statistical capabilities?

    into these from the data science perspective:

     

    • How well do they [statisticians] know their data wrangling skills?
    • Do you try to raise their data wrangling capabilities?

     

    Many statistical educators at the undergrad level use Excel or base R, and never go beyond canned pre-cleaned data.  It's a disservice to students, just as much as teaching statistical learning to "data scientists" without any comments to assumptions and theory.

     

    My path through this wordsmithing of "data scientist, data miner and statistician" is this:  Data science is about creating valuable products from data.  Notice that the definition has nothing to do with statistics, although the valuable product could be a regression analysis to build "secret sauce" formula for real estate prices.  It could also be a concise database of users that access a particular website over a 3-month period based on log files.

     

    Just a few $0.02 in a good thread.

    -Mark



    ------Original Message------

    Hello Everyone,

    There's been a lot of discussion on the net about what a data scientist is, whether we're data scientists, whether statistics is dead, and so on. One of the things that's clear when you read posts by data scientists and for data scientists is that, depending on their background, many data scientists are relearning (and sometimes ignoring), what Statistics teaches. There are also many data scientists who, even if they know their statistics, are not involved in the American Statistical Association or like societies. The ASA published a statement on the Role of Statistics in Data Science (http://www.amstat.org/misc/DataScienceStatement.pdf) last October. The final line says “The ASA aims to facilitate collaboration between statisticians and other data scientists and thus enable them to achieve more than they could on their own.”

    Ron Wasserstein, Executive Director of ASA, discussed the statement in his blog back then (http://community.amstat.org/blogs/ronald-wasserstein/2015/10/01/the-role-of-statistics-in-data-science-an-asa-statement) and outlined some of the ASA’s efforts to “facilitate further collaboration between statisticians and other data scientists.” As Chair-Elect for our section and as a member of the Committee on Applied Statistics, I’m interested in what you do to facilitate this collaboration.

    • Do you collaborate with others who call themselves data scientists?
    • How successful is that collaboration?
    • What makes it successful?
    • Is there anything specific to the statistician / data scientist collaboration that you would not find in other collaborations?
    • How well do they know their statistics?
    • Do you try to raise their statistical capabilities?
    • Do you learn anything from them?
    • Do any of the ASA’s initiatives impact you in your collaboration? How?
    • What other ways could we reach out to data scientists, individually and as a Section, to increase the relationship between us in mutually beneficial ways?

    I look forward to your responses. Thank you in advance for sharing your experiences and ideas.

    Chuck

    ------------------------------
    Chuck Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com
    ------------------------------


  • 14.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 16:24

    So right now I'm working with a handful of data scientists on a big project for our organization. 

    They have web-scraped some data together mostly from online government records and it is a massive administration/geolocation dataset that is clunky and full of holes, typos and mistakes.

    I personally want to have no part in the cleaning, management, storage, and guided access of this data.  But our leadership knows that we have something very useful on our hands if we can make it work.

    My data science colleagues have brought me on to advise them on how to best work with the data.  They look to me as a biostatistician who works with our health researchers on similar data from which the web scraped stuff overlaps.

    I just became the 'subject matter' expert.  That is to say that they look to me on how our researchers may need to use the data.

    So far the collaboration has worked out well.  Everyone has their job and as long as everyone does their tasks then it should be fine.

    I heard someone say recently 'the days of projects having only one 'data guy' are over'.  I agree with that.  Maybe not everyone is facing that reality which is why we are having growing pains.

    As for the idea that data science shouldn’t actually use the term science and that people who do actual knowledge exploration don’t usually call themselves scientists, I want to provide a reminder of this thing called ‘social science’ which has long faced the same pains of being thought of as a ‘softer’ science than the natural or ‘hard’ science. I think the statistics community wants to be very careful to not fall into that sort of trap.  Experimenters who work in sterile laboratory settings aren’t the only individuals who do science and we should take offense from anyone who takes that point of view.

    ------------------------------
    Jason Brinkley
    American Institutes for Research



  • 15.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 16:34
    You might want to look for O'Reilly's "Bad Data Handbook".  It's got helpful tools.



    --
    Sent from Gmail Mobile


    ------Original Message------

    So right now I'm working with a handful of data scientists on a big project for our organization. 

    They have web-scraped some data together mostly from online government records and it is a massive administration/geolocation dataset that is clunky and full of holes, typos and mistakes.

    I personally want to have no part in the cleaning, management, storage, and guided access of this data.  But our leadership knows that we have something very useful on our hands if we can make it work.

    My data science colleagues have brought me on to advise them on how to best work with the data.  They look to me as a biostatistician who works with our health researchers on similar data from which the web scraped stuff overlaps.

    I just became the 'subject matter' expert.  That is to say that they look to me on how our researchers may need to use the data.

    So far the collaboration has worked out well.  Everyone has their job and as long as everyone does their tasks then it should be fine.

    I heard someone say recently 'the days of projects having only one 'data guy' are over'.  I agree with that.  Maybe not everyone is facing that reality which is why we are having growing pains.

    As for the idea that data science shouldn’t actually use the term science and that people who do actual knowledge exploration don’t usually call themselves scientists, I want to provide a reminder of this thing called ‘social science’ which has long faced the same pains of being thought of as a ‘softer’ science than the natural or ‘hard’ science. I think the statistics community wants to be very careful to not fall into that sort of trap.  Experimenters who work in sterile laboratory settings aren’t the only individuals who do science and we should take offense from anyone who takes that point of view.

    ------------------------------
    Jason Brinkley
    American Institutes for Research
    ------------------------------


  • 16.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 17:28
    All,

    Applied statisticians in the field are data scientists.  Many of us call ourselves 'statistical data scientists' (for short) and we call those in IT, 'IT data scientists.'  We analyze data; they manage it.  Applied statisticians are all about problems, that IS our tradition; academic statisticians focus on tool building.  In the field, we must use a problem-based definition of statistics.  Statistics problems have statistics assumptions and require statistical thinking (http://goo.gl/Wod3gk).  Furthermore, if your problem is a statistics one, then whatever tool you are using is a statistics tool.   

    The massive marketing resources of IT want to carve out part of statistics.  IT is looking for growth areas and they have partially embraced statistical denial (
    https://goo.gl/8u31Ok) to accomplish their expansion.  Again, statistics problems do not belong in IT and their intrusion is one of the factors in the coming flood of statistical malfeasance (http://goo.gl/rZ7ys)

    There is no way we can keep statistics problems out of a term like 'data science,' which is our term by the way.  To hand over this terminology, 'data scientist,' is to allow them to take our problems and diminish our profession.  Many of us embrace the terms, 'data scientist' and 'statistical data scientist,' and we call them 'IT data scientists' to position them with data management problems.   

    Randy Bartlett

    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 


    ------Original Message------

    You might want to look for O'Reilly's "Bad Data Handbook".  It's got helpful tools.



    --
    Sent from Gmail Mobile




  • 17.  RE: Reaching out to Data Scientists

    Posted 07-17-2016 00:33
    All,

    What I see in the field is dramatically different from many other people's descriptions.  I understand that most/many of these observations regard what is happening in academia and should not be interpreted to pertain to the field. 

    I see two applied problems: data analysis and data management. 

    The only people I see analyzing Big Data are those I would characterize as applied statisticians. 
    They embrace the science; and probably have a statistics degree or similar.  My colleagues have been working on Big Data all along--as data sizes have increased.  About half of our statistics training occurs after college.  We are mastering techniques that we did not learn from our statistics departments.  This includes greater study of predictive modeling, data mining, nonparametrics, etc. 

    You can not get an accurate picture of applied statistics in the field from Amstat News, ASA conferences,
    IT journals or blogs, or IT talking heads.  IT wants to expand their value proposition beyond their vital role of data management.  I happen to have a computer science degree (and statistics degrees); training and experience doing IT things like building datasets, programming, etc.  It is great stuff and very valuable.  However, I am not an IT expert and I have never met anyone, who is expert at both data analysis and data management. You would not think this if you listen to the bravado of certain IT talking heads.  Here is one of their countless ridiculous claims: '"One big reason [why statistics does not work for Big Data]… is that everything passes statistical tests with significance," he says. "If you have a million records, everything looks like it's good [significant]."'  According to the same person, 'there's a difference between statistical significance and what he calls operational [consequential] significance.'  https://datafloq.com/read/Statistical-Significance-Does-Work-Big-Data/1385. They do not know what they are talking about, and so you can not take their claims that they are performing data analysis on Big Data at face value.  It is balderdash.  Again, I am talking about data analysis in the field. 

    That ASA members have these disjointed conversations is a tribute to a lack of 1. Agreed upon terminology; 2. Surveys of professional activities performed by the membership; and 3. Surveys of applied statisticians in the field (most of whom are not ASA members).  Thousands of people graduate with a statistics degree every year and we do not track where they go or what they do. 

    Randy Bartlett
    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 


    ------Original Message------

    All,

    Applied statisticians in the field are data scientists.  Many of us call ourselves 'statistical data scientists' (for short) and we call those in IT, 'IT data scientists.'  We analyze data; they manage it.  Applied statisticians are all about problems, that IS our tradition; academic statisticians focus on tool building.  In the field, we must use a problem-based definition of statistics.  Statistics problems have statistics assumptions and require statistical thinking (http://goo.gl/Wod3gk).  Furthermore, if your problem is a statistics one, then whatever tool you are using is a statistics tool.   

    The massive marketing resources of IT want to carve out part of statistics.  IT is looking for growth areas and they have partially embraced statistical denial (
    https://goo.gl/8u31Ok) to accomplish their expansion.  Again, statistics problems do not belong in IT and their intrusion is one of the factors in the coming flood of statistical malfeasance (http://goo.gl/rZ7ys)

    There is no way we can keep statistics problems out of a term like 'data science,' which is our term by the way.  To hand over this terminology, 'data scientist,' is to allow them to take our problems and diminish our profession.  Many of us embrace the terms, 'data scientist' and 'statistical data scientist,' and we call them 'IT data scientists' to position them with data management problems.   

    Randy Bartlett

    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 




  • 18.  RE: Reaching out to Data Scientists

    Posted 07-18-2016 09:49

    Hey Randy,

    You bring up a good point. There is a lack of terminology in the various feilds of statistics. There is great disagreement about terminology with statistics and other fields too. The folks at the Center for Open Science are trying to define terms. I wrote Brian Nosek about the difficulties COS will face. 

    For example, my econometrics prof thinks a mixture design has fixed and random effects. Those that deal with BayesIan models think mixture models have multiple distributions. Quality engineers know a mixture design belongs in a design of experiment textbook. 

    What is a covariate? My DOE classes has a very different definition than the psychology stats profs and some biostatisticians I know. What I would call a covariate, others call a moderating variable. 

    When it comes to data science, there is poor terminology for what is big data. Let alone data science. Let alone what makes a data scientist. 

    I spoke with the computer science dept head at U of Mich-Dearborn. They have new degrees in data science. The BS degree starts this fall. The MS degree starts next fall, maybe. The depth head thinks all data scientists use python. I said R and SAS. We discussed why we had a difference of opinion. In my stats classes, we use SAS and R. In computer science, they use everything else but SAS and R.

    Statisticians (and those that use statistics often) tend to think big data is a few hundred thousand to a couple million rows of data. A computer scientist thinks Big Data has millions if not billions of tuples of data. (Note the change in terminology.) 

    A lot of the statisticians I know (and interviewed with) will use Proc Sql or Data Steps in SAS for data manipulation. They also insist on importing the data into their desktops. The comp scientists will use some sort of DBMS program. They also use parallel processing and the servers to do all the work. (I know a few statisticians at Ford were fired for refusing to use Hadoop and Oracle. Now, they won't hire anyone without Hadoop experience in their global analytics department.) The comp sci's methods work for all size of data. The statisticians have a very limited size and scope of data they deal with. (It took one statistics group group 5 years to figure out I was right in one of my interviews. Needless to say, I didn't get hired any of the 5 times I interviewed with this group or half a dozen other groups that felt the same way. Btw, these are not all academic groups.)

    Perhaps someone at ASA could convince a group to work on terminology and appropriate methods for data manipulation.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 19.  RE: Reaching out to Data Scientists

    Posted 07-21-2016 17:25

    Randy,

    There are some firms I run into who break the area down into two parts, which I think is appropriate.  The first are Statisticians or Data Scientists.  This is what you talk about as an Applied Statistician.  Then they have Data Engineers.  These are the DBAs (Data Base Administrators) of Big Data.  The task is a little more complicated that the DBA role in the past because we have moved to a world of heterogeneous data base management systems.  In this new world, there is a lot more emphasis on data quality and data scrubbing because the sources we typically work with are not the operational systems, but include web logs (the original Big Data source), social media and unstructured data, etc.  In my career I have moved between both sides of this and am currently moving back to the Applied Statistics side.  Both are interesting and they are interrelated, or should I say interdependent. 

     

    One could also think about Data Science in the IT sense as an extension of Computer Science.  The reason it is called Computer Science (or the Computing Sciences) is that it is concerned with the understanding of computing as an activity.  Many with computer science degrees go on to be programmers, which really does not require the degree.  Many schools have Software Engineering programs as well, and these are what most people should be taking.  On the other hand, there is a lot of overlap.  Extending these ideas to understanding the data and then coming up with the best ways to organize and manipulate it is really what the CS Data Scientists do.  There is lots of science, mathematics and statistics involved in coming up with efficient algorithms, but this is not statistics in the sense of what an Applied Statistician would be doing.

     

    One recruiting firm I have talked to has a different take on this.  They split the world into analytics/statistics and data science.  The former is the more traditional statistics environment where the data is collected and then analyzed while the later deals with real-time data analysis.  I do not subscribe to this concept, and think it is artificial at best. 

     

    LOUIS W. GIOKAS
    Student, M. S. in Applied Statistics
    DePaul University
    Chicago, IL 60614
    Phone: +1-630-596-6019

     



    ------Original Message------

    All,

    Applied statisticians in the field are data scientists.  Many of us call ourselves 'statistical data scientists' (for short) and we call those in IT, 'IT data scientists.'  We analyze data; they manage it.  Applied statisticians are all about problems, that IS our tradition; academic statisticians focus on tool building.  In the field, we must use a problem-based definition of statistics.  Statistics problems have statistics assumptions and require statistical thinking (http://goo.gl/Wod3gk).  Furthermore, if your problem is a statistics one, then whatever tool you are using is a statistics tool.   

    The massive marketing resources of IT want to carve out part of statistics.  IT is looking for growth areas and they have partially embraced statistical denial (
    https://goo.gl/8u31Ok) to accomplish their expansion.  Again, statistics problems do not belong in IT and their intrusion is one of the factors in the coming flood of statistical malfeasance (http://goo.gl/rZ7ys)

    There is no way we can keep statistics problems out of a term like 'data science,' which is our term by the way.  To hand over this terminology, 'data scientist,' is to allow them to take our problems and diminish our profession.  Many of us embrace the terms, 'data scientist' and 'statistical data scientist,' and we call them 'IT data scientists' to position them with data management problems.   

    Randy Bartlett

    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 




  • 20.  RE: Reaching out to Data Scientists

    Posted 07-27-2016 20:07

    You are among the few, who have insight into both sides of the house. 

    RE: They split the world into analytics/statistics and data science.  ...  I do not subscribe to this concept, and think it is artificial at best. 

    RESP: This is an arbitrary split as DS has data analysis in it.  So if a statistics problem is labeled DS and another exact same problem is labeled analytics/statistics, then do we use different statistical techniques?  Do we make different statistical assumptions?  Do we use different statistical thinking? 

    Louis, how do you think that applied statisticians/statistical data scientists can reach out to statisticians? 

    ------------------------------
    Randy Bartlett



  • 21.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 10:20

    Hi Jason,

    Thanks for the post. We have another Ambassador! :-)

    That's great that they saw where they were lacking and brought you in to help them. It sounds like your "team" has well defined roles and that each respects the others in what they do. That's great to hear.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 22.  RE: Reaching out to Data Scientists

    Posted 07-12-2016 17:39
    I previously sent this note by email to chuck and given the many posts from the group , below is likely to be of broader interest
     
     
    When I was chapter president (www.sfasa.org)  I put a lot of effort into outreach to the "ACM" - Association for Computing Machinery - the computer scientist professional organization.  ACM would typically have 50 or more people at their seminar . ACM could also  afford lots of pizza and soda pop, apparently a big draw.
     
    There was some overlap in the monthly seminars of the two groups. ACM in Silicon Valley would occasionally (once or twice a year) have a statistics related seminar.
     One seminar I recall in particular was David Draper (from UCSC) giving a talk on bayesian statistics by invitation to the ACM chapter meeting. In the bay area the local  ACM sponsors "data mining boot camps" and a "hacker dojo". http://www.hackerdojo.com/
     
     
     I've been to one "data mining boot camp", held at ebay, with about 200 datascientist types. Also in the bay area is "Datakind" a "data science group for social good" www.datakind.com
    Datakind has meetings which I've attended. I deliberately attended to meet the founder Jake Porway ( I believe Jake is also an ASA member). 
    The ACM people were not particularly interested in "networking" and  didn't want to co-sponsor a seminar unless our chapter contributed some money, 
    (I assume to pay for the pizza).
     
    There is also a very active R meetup group , meeting monthly with occasional "big data" talks
     
    The datascientists aren't the "only show in the Valley" the 
    American Society for Quality (ASQ), has reliability engineers and they sponsor monthly meetings for statistics/reliabitlity topics .http://asq.org/service/body-of-knowledge/tools-data-mining
    there is also the IEEE who occasionally cosponsor seminars with ASQ. I wasn't able to get in contact with them while I was chapter president.
    There is also Silicon Valley data academy http://siliconvalleydataacademy.com/
     
     
    • The list of "data science " related events in Silicon Valley is  increasing 
     
    • Mainly the first regular interface/network/outreach  with data scientists is through local seminars and workshops.
    ------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy



  • 23.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 09:43

    Good Morning Chris,

    Thanks for the post. Your experiences are very helpful. Our biggest challenge may be how to help people who don't want to be helped or don't think they need it. Thinking about it, hasn't this always been the challenge for Statisticians with the rest of the world, whether it's clients, politicians, psuedo-statisticians in the social sciences, or now data scientists?

    I agree that the meetings, seminars and presentations are good ways to reach out to practitioners who are not members of the ASA. Whether they do data science, reliability engineering, marketing analytics, business analytics, web analytics, etc. doesn't matter. What David Draper did was great. If people giving those seminars can entice attendees to learn more about statistics and its proper use, then we all benefit.

    We have to approach it right, though. We have to respond to data scientists in ways that will be of interest and of use to them.

    Maybe there is also a way that we can also invite them to ASA chapter meetings, if we can get announcements in their e-mailings. If we are welcoming and non-judgmental, then they may want to come back and learn more.

    Thank you again for your comments and suggestions.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 24.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 02:59

    Hiya everyone,

     

    For quick background, I'm a statistician at Northern Kentucky University and one of three faculty who teach the core data science courses for our bachelor's degree in data science.  Our first cohort graduates this spring in 2017.

     

    Here's my go at the bullets:

     

    I'm interested in what you do to facilitate this collaboration.

    • Do you collaborate with others who call themselves data scientists?
    • How successful is that collaboration?
    • What makes it successful?

    Right now, our collaboration is primarily with the HR of other companies who are looking for new data scientists; we have regular contact with both start-ups and established companies trying to find new data scientists.  Our program requires courses in business informatics (which I think is kinda neat), so the students have some idea about program management and business processes.  So far, several of our juniors are interning in the financial, heath care, and industrial sectors over the summer.

    • Is there anything specific to the statistician / data scientist collaboration that you would not find in other collaborations?
    • How well do they know their statistics?
    • Do you try to raise their statistical capabilities?
    • Do you learn anything from them?

    Most people with the label "data scientist" (who probably wouldn't have a formal degree in data science) who I've met are either formally computer scientists or statisticians (or a really good STEM-based jack-of-all-trades).  Some computer scientists know their way around R and Weka, others don't.  Just like trying to find statisticians who might know Java, C, Python and how to do distributed computing with Hadoop.  Many don't.  Claiming to be a data scientist doesn't mean that one claims to know everything about all the fields that contribute to data science.  That's why ultimately it's a team-based sport.  It's something that I try to drill into my students; that their resume has to give an accurate representation of their specific skill set.

    Most of the time in working with other computer scientists, bioinformaticians (bioinformaticists?), etc., I've learned that the data is hard and the analysis (if any) is straight-forward (or at least much easier to deal with once data is ready).  Data extraction and clean-up can be a brutal, time-intensive, joyless process.  I think that statisticians who have honed their skills over decades to deal with bad data efficiently probably feel the most burn with new folks calling themselves data scientists and stealing their thunder.  That is where statisticians probably dropped the ball in the whole "data scientist" namecalling – not capitalizing on the immeasurable need to "wrangle" data and teach courses toward that as part of the life of a statistician.

    • Do any of the ASA's initiatives impact you in your collaboration? How?
    • What other ways could we reach out to data scientists, individually and as a Section, to increase the relationship between us in mutually beneficial ways?

    At the undergrad level, students majoring in this new data science program are most concerned with employment, and the JSM provides a great place to bring in the movers and shakers of the field.  The ASA DataFest encapsulates the type of work that a "modern" data scientist would be expected to do, being team-based & multi-faceted (data wrangling, analysis, visualization, etc.).

     

    Just a few thoughts, before it chimes midnight on the best coast.

    -Mark

     



    ------Original Message------

    Hello Everyone,

    There's been a lot of discussion on the net about what a data scientist is, whether we're data scientists, whether statistics is dead, and so on. One of the things that's clear when you read posts by data scientists and for data scientists is that, depending on their background, many data scientists are relearning (and sometimes ignoring), what Statistics teaches. There are also many data scientists who, even if they know their statistics, are not involved in the American Statistical Association or like societies. The ASA published a statement on the Role of Statistics in Data Science (http://www.amstat.org/misc/DataScienceStatement.pdf) last October. The final line says “The ASA aims to facilitate collaboration between statisticians and other data scientists and thus enable them to achieve more than they could on their own.”

    Ron Wasserstein, Executive Director of ASA, discussed the statement in his blog back then (http://community.amstat.org/blogs/ronald-wasserstein/2015/10/01/the-role-of-statistics-in-data-science-an-asa-statement) and outlined some of the ASA’s efforts to “facilitate further collaboration between statisticians and other data scientists.” As Chair-Elect for our section and as a member of the Committee on Applied Statistics, I’m interested in what you do to facilitate this collaboration.

    • Do you collaborate with others who call themselves data scientists?
    • How successful is that collaboration?
    • What makes it successful?
    • Is there anything specific to the statistician / data scientist collaboration that you would not find in other collaborations?
    • How well do they know their statistics?
    • Do you try to raise their statistical capabilities?
    • Do you learn anything from them?
    • Do any of the ASA’s initiatives impact you in your collaboration? How?
    • What other ways could we reach out to data scientists, individually and as a Section, to increase the relationship between us in mutually beneficial ways?

    I look forward to your responses. Thank you in advance for sharing your experiences and ideas.

    Chuck

    ------------------------------
    Chuck Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com
    ------------------------------


  • 25.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 10:53

    So shameless plug time, but related to this discussion:

    For this year's JSM I am leading a roundtable for the Section on Statistical Consulting titled "From Criticism to Curiosity: Changing Our Attitudes in Consulting with Data-Oriented Individuals in Other Fields"

    Here is the abstract:
    The idea of 'statistician as critic' has been popular in our discipline for quite a long time. Statisticians serve as the expert evaluator for methods and quality of data-driven results in a multitude of areas. At times, in taking this role to heart, many statisticians can come off as overly critical and potentially condescending in working with clients who feel they are data and method oriented. Sometimes the discussion comes down to 'what is right or wrong' instead of 'how can we do better.' The impact of this idea is clear when one looks at movements over the years in fields like psychology or computer science to develop alternate methodologies for working with data that are not as grounding in traditional statistical theory. The purpose of this roundtable is to discuss this issue and illustrate how permeated the idea of serving as critic is embedded within our culture. As an alternative, it will be suggested that statisticians approach problems with curiosity instead of evaluation and that discussions free of judgement will create more productive relationships.

    My point in the roundtable was to talk about how best to consult with and work with 'pseudo-statisticians' (Chuck's term from earlier, not mine).  My thought had been to focus mostly on experiences with economists, psychologists, epidemiologists, and computer scientists who work with data a lot but may potentially have other focal points beyond working with data.  Oftentimes these can be some of the most challenging consults because these people have some knowledge and experience with data and they may have an approach that we question or we may sometimes not give as much value to their existing skills.  I can't tell you how many times I've heard 'I'm not a statistician but...' and I think some of that defensive language evolves from the constant criticism that the statistics community provides in viewing their work.  I like the earlier posts about how even mathematical statisticians, applied statisticians, and biostatisticians may approach similar problems in different ways depending on the source of the data and what is the research expectations. 

    So my question is, where does this community see those that call themselves data scientists?  Are they closer to the econo/epi/quant psych/analytic chemists or are they closer to the applied stat/biostat realm?  Does it depend on training?  Is the reason why are having these issues is because we really don't know where to classify this group, at least in relation to the traditional statistics community?  Siblings or cousins?


    Also, the roundtable is full but I do plan to use some of the comments on this forum and will direct attendees to it.

    ------------------------------
    Jason Brinkley
    American Institutes for Research



  • 26.  RE: Reaching out to Data Scientists

    Posted 07-21-2016 09:40

    Jason - when is your round table?  I want to attend!

    ------------------------------
    Susan Spruill
    Statistical Consultant



  • 27.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 11:29

    Hello Mark,

    Thank you for your response to the questions. Congratulations on being part of a new bachelor's degree program. Good luck with the teaching. 

    I think a program like yours can be a big help in reaching out to data scientists who are not members of ASA. As you all train them, I'm sure they will see the value of statistics and have exposure to the breadth of our discipline, as well as some of the less glamorous parts, such as design of experiments. (less glamorous meaning not machine learning :-) Then when these people go into the real world, they'll be ambassadors to other data scientists.

    Am I wrong in my thinking?

    Any new thoughts now that you've had a good night's sleep. :-)

    Thanks again.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 28.  RE: Reaching out to Data Scientists

    Posted 07-18-2016 22:37

    Hiya Chuck & the Consulting Group

     

    When the program was made, that was definitely the idea – business, computer science, and math/stat all are part of the program in pretty much equal amounts.  We are currently looking at our electives to make sure students have choices that some (but all might not) need, such as SAS programming, bootstrapping/simulation, DoE, etc.

     

    Since the topic of definitions has come up again recently, I'll throw out all of the ones that I use.  In general, if you end up creating synonyms, you end up with angry people who view you as stealing their job title.  I'm trying to go for maximal separation, so that the lanes are well-marked as to who does what.

     

    So here we go, definitions that are short and separate out the tasks most commonly used by

     

    Data Scientist:  Someone who creates valuable products from data.

     

    You never have to touch a statistical method, and you can still claim this title if you do data wrangling. Think about it this way ---  If a client wanted to create the GraceNote music database of CDs, would statisticians be chomping at the bit to take on this project, a mix of algorithm development, crowdsourcing, database management, and various IT issues?  This is a data scientist job.

     

    Data miner:  Someone who explores data in order to come up with hypotheses to test.  This would then need additional data and appropriate methods to complete the analysis.

     

    Someone who is great with algorithms and efficient programming is perfect for this; computer scientists and mathematicians have legitimate claim on this realm, especially with regard to graph-theoretic network analysis.

     

    Statistician:  Someone who characterizes the uncertainty in the data (how did the data come to be?) Building models and analyzing data is a key part.

     

    There are many statisticians who never want to touch dirty data, and many more who are quite happy creating the theory.  They have a place here.

     

    I'm going to borrow from the folks at Fort Meade, who provided one of the best definitions I've heard, since it won't change over time.

     

    Big Data:  Data that can never be completely stored, accessed, or computed on.

     

    Most people do not have big data.  They have hard-to-handle data that requires statisticians, data scientists, data miners, computer scientists, and IT to solve their problems.



    ------Original Message------

    Hello Mark,

    Thank you for your response to the questions. Congratulations on being part of a new bachelor's degree program. Good luck with the teaching. 

    I think a program like yours can be a big help in reaching out to data scientists who are not members of ASA. As you all train them, I'm sure they will see the value of statistics and have exposure to the breadth of our discipline, as well as some of the less glamorous parts, such as design of experiments. (less glamorous meaning not machine learning :-) Then when these people go into the real world, they'll be ambassadors to other data scientists.

    Am I wrong in my thinking?

    Any new thoughts now that you've had a good night's sleep. :-)

    Thanks again.

    Chuck

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com
    ------------------------------


  • 29.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 13:01

    Mark,

    You have an interesting view on what Big Data is. If the data can never be completely stored, accessed or computed on, that sounds like a rock so large God can even lift it.

    If you change that definition to never be computed on a typical desktop, I'd agree with you. I think true "Big Data" requires you to use banks of servers to process and store the data. 

    Looking at transaction data for a behemoth company like Amazon or Wal-Mart is Big Data. The million tuple plus data sets some of my friends look at is not big. It's just large compared to what we used in our stats classes. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 30.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 13:16
    I would concur, if Big Data is going to be defined as that which can never be completely stored, accessed or computed on, then aren't we coming full circle since that is pretty close to the rationale behind sampling rather than using the population data; in most cases we would never have full access to all the data so sample it?

    I would also tend to agree that most people really don't use big data [volume wise definition], even if they think they are. For example, at a bank I worked at it was nothing to pull thousands of transactions on millions of accounts [i.e tens of billions of records] to do analytical work on a daily basis.

    I don't classify any of that as big data.

    It might just be in the eye of the beholder or if you want an objective definition - it really has nothing to do with the size of the data, but more to do with which techniques you are using on it.

    In my opinion, Big Data is not a measure of volume of data but more of a descriptor of techniques used on data. Much like Light-Year is not really  measure of time, but of distance.

    Ike Eisenhauer





    ------Original Message------

    Mark,

    You have an interesting view on what Big Data is. If the data can never be completely stored, accessed or computed on, that sounds like a rock so large God can even lift it.

    If you change that definition to never be computed on a typical desktop, I'd agree with you. I think true "Big Data" requires you to use banks of servers to process and store the data. 

    Looking at transaction data for a behemoth company like Amazon or Wal-Mart is Big Data. The million tuple plus data sets some of my friends look at is not big. It's just large compared to what we used in our stats classes. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 31.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 14:14

    Hiya Andrew,

     

    If you move the definition of big data to be workable "on a typical desktop," the amount of work you can do depends on the year of the equipment.  I'm looking for a timeless definition.

     

    The NSA-based definition of Big Data brings to the forefront that data science cannot do without statisticians in big data problems, in that we cannot forget sampling and design of experiments to lower the computational complexity back into the realm of what can be accomplished with "today's" devices.

     

    -Mark



    ------Original Message------

    Mark,

    You have an interesting view on what Big Data is. If the data can never be completely stored, accessed or computed on, that sounds like a rock so large God can even lift it.

    If you change that definition to never be computed on a typical desktop, I'd agree with you. I think true "Big Data" requires you to use banks of servers to process and store the data. 

    Looking at transaction data for a behemoth company like Amazon or Wal-Mart is Big Data. The million tuple plus data sets some of my friends look at is not big. It's just large compared to what we used in our stats classes. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 32.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 15:44
    Mark,

    RE:
    If you move the definition of big data to be workable "on a typical desktop," the amount of work you can do depends on the year of the equipment.  I'm looking for a timeless definition.
    RESP:
    One possible humble definition: 'We have Big Data when the Volume, Velocity, Variety, and/or Veracity are part of the problem.'    

    RE: The NSA-based definition of Big Data brings to the forefront that data science cannot do without statisticians in big data problems, in that we cannot forget sampling and design of experiments to lower the computational complexity back into the realm of what can be accomplished with "today's" devices.
    RESP: We are heading toward econometric decisions involving the value of information.  E.g., spending $X on collecting, storing, cleaning, and analyzing a billion observations by a thousand variables versus spending $Y on designing an experiment/taking a sample.  The first yields information worth $A and the second $B.  Go. 

    Randy Bartlett
    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 


    ------Original Message------

    Hiya Andrew,

     

    If you move the definition of big data to be workable "on a typical desktop," the amount of work you can do depends on the year of the equipment.  I'm looking for a timeless definition.

     

    The NSA-based definition of Big Data brings to the forefront that data science cannot do without statisticians in big data problems, in that we cannot forget sampling and design of experiments to lower the computational complexity back into the realm of what can be accomplished with "today's" devices.

     

    -Mark





  • 33.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 21:20

    Upon reading the voluminous email on data science versus statistics, an epiphanous image came to me. Sciences are all centered about a topic or entity. Statistics is not. It uses data; it uses probability; it uses mathematics; but it is not these disciplines. It applies to medicine, law, biology, physics, economics, psychology, business, and almost every other discipline to which a university has applied a department name, but it is not any one of those things nor is it like one of those things.

    Think for a moment about this image. Statistics is a mode of thought and action lying over and encompassing a discipline that renders understanding about the nature of that discipline. It is the explanatory umbrella for all quantifiable disciplines and their combinations. (You have to hoist aboard these ideas word by word-not as a scanned bundle.)

    We have never been able to give it a specific subject matter or name as a science because it includes them all. We have never been able to outline its boundaries as is done with other disciplines because it has none. That is why "statistician" is too vague a name to have clear meaning to scientists, journalists, or the public.

    Maybe this has been said before; I just haven't seen it. Or maybe it is just the rambling of an aged mind and I will laugh at it tomorrow.

    --Bob Riffenburgh

     



    ------Original Message------

    Hiya Andrew,

     

    If you move the definition of big data to be workable "on a typical desktop," the amount of work you can do depends on the year of the equipment.  I'm looking for a timeless definition.

     

    The NSA-based definition of Big Data brings to the forefront that data science cannot do without statisticians in big data problems, in that we cannot forget sampling and design of experiments to lower the computational complexity back into the realm of what can be accomplished with "today's" devices.

     

    -Mark





  • 34.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 21:34

    This is an element of why I say that we have to fundamentally re-evaluate our role in the decision making value stream. As I said in my separate post in this thread, we exist to help others make decisions, just as data scientists do; the difference is that the "data scientists" (whatever that means) does it using different algorithmic methodologies than the "statisticians" are typically accustomed to. The common challenge is that the so-called "scientific community" (not my term) expects those with statistical skills to be just like them in understanding of the subject matter AND have the perceived methodological skill, rather than someone who can help them make the right decisions. We can talk about "collaboration" with the "scientific community" all day long, but until we come to terms with this very issue the problem will continue to persist (I would like to insert a shameless plug here that the JSM invited panel I'm moderating is closely related to this subject). I can say from experience that the data scientists suffer from the same problem as statisticians do in this regard, and I have written a few pieces on the topic as well as devoted a good part of my client work on the topic.

    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics



  • 35.  RE: Reaching out to Data Scientists

    Posted 07-20-2016 12:42

    Michiko and Robert bring up some issues on the (sad) state of statistics. If you go back and look at what some of the most famous statisticians did during the 1900's to 1960's, you find that people like William Gossett was a chemist who needed to compare small sample sizes for production of beer. R.A. Fisher is considered one of the greatest biologists of the past 100 years. George Box was a chemist who turned to statistics and dealt with statistical methods for industrial production. They were scientists! The also did statistics with a purpose, to further scientific discovery. Today, we have a lot of statisticians that come from pure mathematical backgrounds and have no concept of where statistics is used outside of statistics.(Which is really sad.)

    When I was transitioning from Chemistry to Applied Mathemtics/Statsitcs, I asked lots of the statisticians about applications of Design of Experiment methods in a chemistry lab setting. I discussed with a couple consultants about the use of Mixture Designs and Optimal Response Surfaces for toxicology testing of multiple chemicals in the same experiment. The general consensus among the consultants was that "it's impossible". Not because it is impossible but rather no one else is doing it that way. The consultants also assumed that the consultees, chemists, biologists and toxicologists, already "know" about Design of Experiments and have tried them and they failed. The truth is, most academic scientists have no clue about statistical methods beyond simple linear regression and t-tests. (Try asking scientists that don't ask you for help about what statistical methods they use, or look it up in published articles from "respected" science journals, not just medical journals.) Industrial scientists know more about statistics because of Six Sigma projects and working with Industrial Engineers.

    After I completed my 12th stats class, had my MS in Applied Mathematics and applied to "statistical consulting" jobs in my area, I was passed over for an interview most of the time. Through the local ASA groups, I got to know some of the hiring managers and other employees at these facilities. So, I got to ask them about the data analysis they perform. I questioned them (10-12) about what methods do their scientists use for quality control. The answer was, "I don't know. The scientists give me the data and I analyze it." When I asked, "How do you know the data is valid?" The answer was, "The scientist only gives me valid data." I'll ask, "How much invalid data is there?", Them, "I don't know." Which seems not only not scientific, but possibly anti-scientific. It seems like a lot of today's statisticians are there merely to give the impression of scientific validity. (I never knew, "Because I said so!" was valid and viable scientific reasoning. Sorry mom for questioning you;-) 

    So, there could be another issue like the hole in the O-zone layer. (30%-50% of the data was invalidated by scientists because it was too low and didn't conform to the scientists expectations. Which is why it took over 10 years to "discover" when it was clear that it was happening in cycles and getting worse after only a few years.) And because only "valid" data comes through, there is a false sense that everything is fine. Yet in industry, with the emphasis on continuous improvement and good QC protocols, industrial scientists know that their instruments can fail and give biased results..... With the help of Industrial Engineers.

    Meanwhile, we are discussing how to communicate between statistically savvy computer scientists and statisticians. Perhaps we should focus on communicating between scientists and statisticians first. Let's get back to the roots of good statistical practice. Let's look at the method scientists use to generate the data we analyze and gather. Let's test the theories about why X is related to Y. Let's be more like Peter Goos and Brad Jones and look at and use better design of experiments methods and discuss their uses in the field. Let's communicate how much better a Definitive Screening Design is than an OFAT method. Let's make sure we have valid data before we get involved in the analysis.

    Then we can discuss what makes a data scientists and how we can work with our scientific cousins.  

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 36.  RE: Reaching out to Data Scientists

    Posted 07-19-2016 21:42
    That is a great explanation. I will be paraphrasing or straight lifting some of that any time an interviewer asks about whether I have direct experience in XYZ where I do not. It really often just doesn't matter. Better to diversify the team instead, in my opinion. 


    ------Original Message------

    Upon reading the voluminous email on data science versus statistics, an epiphanous image came to me. Sciences are all centered about a topic or entity. Statistics is not. It uses data; it uses probability; it uses mathematics; but it is not these disciplines. It applies to medicine, law, biology, physics, economics, psychology, business, and almost every other discipline to which a university has applied a department name, but it is not any one of those things nor is it like one of those things.

    Think for a moment about this image. Statistics is a mode of thought and action lying over and encompassing a discipline that renders understanding about the nature of that discipline. It is the explanatory umbrella for all quantifiable disciplines and their combinations. (You have to hoist aboard these ideas word by word-not as a scanned bundle.)

    We have never been able to give it a specific subject matter or name as a science because it includes them all. We have never been able to outline its boundaries as is done with other disciplines because it has none. That is why "statistician" is too vague a name to have clear meaning to scientists, journalists, or the public.

    Maybe this has been said before; I just haven't seen it. Or maybe it is just the rambling of an aged mind and I will laugh at it tomorrow.

    --Bob Riffenburgh

     





  • 37.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 11:39

    Below is a very interesting video of a panel discussion entitled "Data Science and Statistics: different world?" The panel was sponsored by the Royal Statistical Society and includes Chris Wiggins, Chief Data Scientist at the NY Times and Associate Professor of Applied Mathematics at Columbia; David Hand, Emeritus Professor of Mathematics at Imperial College; Francine Bennett, Founder of Mastodon-C; Patrick Wolfe, Professor of Statistics at the University College London and Executive Director of the UCL Big Data Institute; and Zoubin Ghahramani, Professor of Machine Learning, University of Cambridge.

    Data Science and Statistics: different worlds?

    YouTube remove preview
    Data Science and Statistics: different worlds?
    Chris Wiggins (Chief Data Scientist, New York Times) David Hand (Emeritus Professor of Mathematics, Imperial College) Francine Bennett (Founder, Mastodon-C) Patrick Wolfe (Professor of Statistics, UCL / Executive Director, UCL Big Data Institute) Zoubin Ghahramani (Professor of Machine Learning, University of Cambridge) Chair: Martin Goodson (Vice-President Data Science, Skimlinks) Discussant: John Pullinger (UK National Statistician) In the last few years data science has become an increasingly popular discipline.
    View this on YouTube >
    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 38.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 11:46

    I'd also like to ask the group to consider responding to this survey from the Committee on Applied Statistics. As explained below, the survey addresses the broader topic of collaboration. Collaborating with Data Scientists is part of that, so your answers will be helpful.

    Thank you.

    FILL OUT IN GOOGLE FORMS

     

    Dear Friend of CAS: STAND UP AND BE COUNTED! We statisticians are well aware of the current trend toward decreasing survey response rates that can bias a study’s results. Let’s practice what we preach and prove that statisticians value high response rates by completing this survey conducted on our very own population of professionals.

    Your assistance is requested for the first CAS survey of statisticians and collaborations. This research will be used to shape the committee’s new initiative focused on improving collaboration skills within ASA and statisticians. Survey results will provide an overview of current strengths and weaknesses, needs and opinions related to your collaborations.

    The questions are brief and primarily relate to your work experience. It should take a few minutes to complete. Your response is very important to provide an appropriate representation of statisticians. If you have any difficulty accessing the web survey, please contact the committee at appliedstatistcians@gmail.com.

    Your participation is voluntary and we encourage you to make this special survey a priority.

    Sincerely,

    Erin Tanenbaum, Mark Otto, and the rest of the Committee on Applied Statisticians

    To help protect your confidentiality, the surveys will not contain information that will personally identify you. Your name will not be associated with any information you provide unless you choose to provide your name. The results of this study will be used to shape our initiative and will be shared with the ASA. Summarized data may be summarized and disseminated at an upcoming JSM. Taking the survey indicates that you have read the above information and that you agree to participate. Thank you very much for your cooperation.

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 39.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 14:01
    Hi Everyone,

    I guess I will throw my two cents in.  If the goal is more collaboration and interaction with Data Scientists, why not hold a joint meeting/workshop built around it.  We could have "Data Science" people help educate Statisticians on items 1 and 2 from the list previously proposed and we could have Statisticians educate the "Data Science" people on the other items on the list.

    I think there is a huge area for collaboration in that many of the methods Statisticians develop are used by Data Scientists.  And I would argue that Statisticians who are working with "Big Data" are both Statisticians and Data Scientists. Hence we already have Statisticians who are prepared to engage in these sorts of collaborations.

    Maybe a joint conference/workshop held with say an INFORMS society as well as an ACM society would provide a forum to meet and learn from each other.  I think that is where collaboration will begin.  Learning what we have in common and learning skills from each other.

    If anyone would like to work on setting up a workshop/conference I would be more than willing to help put something like this together.  

    Thanks,
    Ed

    Edward L Boone
    Associate Professor of Statistics
    Department of Statistical Science and Operations Research
    Virginia Commonwealth University
    4123 Grace Harris Hall
    1015 Floyd Ave.
    Richmond, VA 23284
    Phone:  +1 804 828-4637
    Google+:  Ed Boone


    ------Original Message------

    I'd also like to ask the group to consider responding to this survey from the Committee on Applied Statistics. As explained below, the survey addresses the broader topic of collaboration. Collaborating with Data Scientists is part of that, so your answers will be helpful.

    Thank you.

    FILL OUT IN GOOGLE FORMS

     

    Dear Friend of CAS: STAND UP AND BE COUNTED! We statisticians are well aware of the current trend toward decreasing survey response rates that can bias a study’s results. Let’s practice what we preach and prove that statisticians value high response rates by completing this survey conducted on our very own population of professionals.

    Your assistance is requested for the first CAS survey of statisticians and collaborations. This research will be used to shape the committee’s new initiative focused on improving collaboration skills within ASA and statisticians. Survey results will provide an overview of current strengths and weaknesses, needs and opinions related to your collaborations.

    The questions are brief and primarily relate to your work experience. It should take a few minutes to complete. Your response is very important to provide an appropriate representation of statisticians. If you have any difficulty accessing the web survey, please contact the committee at appliedstatistcians@gmail.com.

    Your participation is voluntary and we encourage you to make this special survey a priority.

    Sincerely,

    Erin Tanenbaum, Mark Otto, and the rest of the Committee on Applied Statisticians

    To help protect your confidentiality, the surveys will not contain information that will personally identify you. Your name will not be associated with any information you provide unless you choose to provide your name. The results of this study will be used to shape our initiative and will be shared with the ASA. Summarized data may be summarized and disseminated at an upcoming JSM. Taking the survey indicates that you have read the above information and that you agree to participate. Thank you very much for your cooperation.

    ------------------------------
    Charles Kincaid
    Engagement Director
    Experis Business Analytics
    chuck.kincaid@experis.com



  • 40.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 14:18

    Ed, great idea. In case he's not reading this thread,  Jim Cochran (at UAB)

    is a member of both ASA and Informs - he's a good contact for INFORMS. 

    ------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy



  • 41.  RE: Reaching out to Data Scientists

    Posted 07-13-2016 14:25

    I work as a biostatistics consultant and I have some thoughts on an aspect of "data science" as it relates to my field. In this context my (poor) understanding of data science is it offers the ability to sift through reams of new and historical data in order to maximize pertinent information with respect to a medical condition and its treatments. In this sense data science can be of benefit to reveal patterns of efficacy responses when in combination with safety and demographic outcomes. But one thing that to my mind appears to be overlooked is that over time medicine's understanding of diseases and their causes has frequently greatly evolved. As an example, a patient being evaluated for entry into a clinical trial in 1990 and being rejected as a screen failure, and thus never treated, might be enrolled into the same study if it were conducted in 2016. The same can be true in reverse; a patient enrolled into a study in 1990 might not be qualified for entry into that study in 2016. The difference is that medical techniques, diagnostic tests and devices, and physicians' knowledge of the disease progression are much better in 2016 than they were in 1990.

    But the historical data do exist. Should they be blindly included when investigating the disease and potential medical cures? What should be done to try and make the outcomes of interest as consistent as possible so that a (non-data) scientist can make a statement true for 2016 and possible the next handful of years, say through 2020, valid?

    ------------------------------
    Nestor Rohowsky
    President and Principal Consultant
    Integrated Data Consultation Svcs, Inc.



  • 42.  RE: Reaching out to Data Scientists

    Posted 07-17-2016 09:19

    I’ve been called both a statistician and a data scientist (among other things!), and have hired both “statisticians” and “data scientists.” I work a good bit with both on the non-technical front today, and help organizations figure out what all of this means to them. While much of the discussion focuses on the difference between these two, there is one thing in common: we both exist to help others answer their research/business questions, and we do so through data. The difference is primarily in the specific methodology/algorithm/approach and maybe the specific research/business questions that we’re trying to address. I agree with the point made earlier about mathematical statisticians vs. applied statisticians vs. etc. I also agree with Randy that a lot of discussions that go on among statisticians do not reflect the reality in a non-ignorable chunk of the world.

    However, more specifically within the applied statistics realm, there is a lot in common between predictive statistical modelers and data scientists who do predictive analytics. The advantage that statisticians should have, at least in theory, is that statistics is founded on the idea of variability as someone else pointed out; that is, it is built on the fundamentals of probability. However, I’ve seen plenty of so-called statisticians to whom statistics are a merely a set of algorithms. To me they are no different than the data scientists commonly perceived. Keep in mind that predictive modeling statisticians work under conditions that some statisticians find deplorable: the classical assumptions can NEVER be true--for example, you can't take a random sample of future events (at least not yet until they invent time travel), which obviously has design implications. I've met very few statisticians who truly understand what this means, and I have been fortunate to hire them.

    From my perspective, it is difficult for me to talk about collaboration between the two camps without understanding the differences and especially the similarities. FWIW, I’ve run into a similar discussion with Statistics without Borders involvement with Digital Humanitarian Network, specifically where SWB ends and GIS Corp (SWB-like organization of GIS folks) now that SWB is more frequently into data visualization that often includes maps. However, the discussion in this case is much less contentious, perhaps because the goals are clear – we’re there to serve the humanitarian needs, and we have had a number of projects in which we collaborated with no issues. Take from this what you will.

    My 0.02.

    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics



  • 43.  RE: Reaching out to Data Scientists

    Posted 07-17-2016 11:10
    All,

    Michiko and I are on the same page, except for this term, 'data science.'  The term is used to capture two disparate fields, which use different skills, software, and thinking.  IT data scientists are managing data and sometimes claiming to analyze data.  Statistical data scientists/applied statisticians are analyzing data.  This term is a Trojan horse for IT to reach for statistics problems.  When Michiko writes, '... have hired both "statisticians" and "data scientists,"This statement is open to interpretation.  It might mean that she has hired people to analyze the data and to manage the data.  It might mean that she has hired two types of statisticians.  It might mean that she has hired two types of statisticians and IT to manage the data. 

    Elsewhere Michiko writes,
    'Keep in mind that predictive modeling statisticians work under' different conditions; 'the classical assumptions can NEVER be true.'  Again, we are on the same page.  She is referring to the fact that in the field we use statistical techniques, which are not emphasized and sometimes not taught in grad school.  This does not mean that what we are doing is not statistics as some have claimed.  If sampling was not taught in grad school, that would not mean that it is not part of statistics.  Our statistics problems involve uncertainty with the numbers, http://goo.gl/Wod3gk and can always be traced back to what earlier statisticians were addressing.  Often times we have massive observational data and must use statistical techniques like cross validation.  Furthermore, we engage in far more prediction and far less hypothesis testing than say, academic statisticians. 

    Another misunderstanding is this idea that there is a need to teach IT how to analyze data.  No one wants that.  We want statisticians to analyze data so that it is done correctly and so that our profession will live on.  An applied statistician, who teaches IT how to do their job, is a fool for multiple reasons.  We want IT to hand off the statistics problems to the appropriate skill set.  IT is being told by their massive media arm that they can analyze data by just running an algorithm--'Its all in the algorithm.'  The mission of IT is to one day put all human knowledge into an algorithm.  IT thinks deductively and they do not fully appreciate the complications of dealing with uncertainty.  This has huge implications for IoT. 

    Randy Bartlett
    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 


    ------Original Message------

    I’ve been called both a statistician and a data scientist (among other things!), and have hired both “statisticians” and “data scientists.” I work a good bit with both on the non-technical front today, and help organizations figure out what all of this means to them. While much of the discussion focuses on the difference between these two, there is one thing in common: we both exist to help others answer their research/business questions, and we do so through data. The difference is primarily in the specific methodology/algorithm/approach and maybe the specific research/business questions that we’re trying to address. I agree with the point made earlier about mathematical statisticians vs. applied statisticians vs. etc. I also agree with Randy that a lot of discussions that go on among statisticians do not reflect the reality in a non-ignorable chunk of the world.

    However, more specifically within the applied statistics realm, there is a lot in common between predictive statistical modelers and data scientists who do predictive analytics. The advantage that statisticians should have, at least in theory, is that statistics is founded on the idea of variability as someone else pointed out; that is, it is built on the fundamentals of probability. However, I’ve seen plenty of so-called statisticians to whom statistics are a merely a set of algorithms. To me they are no different than the data scientists commonly perceived. Keep in mind that predictive modeling statisticians work under conditions that some statisticians find deplorable: the classical assumptions can NEVER be true--for example, you can't take a random sample of future events (at least not yet until they invent time travel), which obviously has design implications. I've met very few statisticians who truly understand what this means, and I have been fortunate to hire them.

    From my perspective, it is difficult for me to talk about collaboration between the two camps without understanding the differences and especially the similarities. FWIW, I’ve run into a similar discussion with Statistics without Borders involvement with Digital Humanitarian Network, specifically where SWB ends and GIS Corp (SWB-like organization of GIS folks) now that SWB is more frequently into data visualization that often includes maps. However, the discussion in this case is much less contentious, perhaps because the goals are clear – we’re there to serve the humanitarian needs, and we have had a number of projects in which we collaborated with no issues. Take from this what you will.

    My 0.02.

    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics
    ------------------------------


  • 44.  RE: Reaching out to Data Scientists

    Posted 07-17-2016 13:18

    FWIW, I deliberately leave the "statisticians" and "data scientists" in quotes--they are not my terms. It's what others (people, job titles in the HR system, etc.) called them, and I think many will agree that these terms are really in the eyes of the beholder. Therefore yes, it is supposed to be open to interpretation and thus I disagree with Randy's disagreement ;-) I've all but given up on the labels and simply refer to myself and my colleagues as "analytical consultants." Most people in my world don't question me any more.

    My point is simply that it is useful to understand the what vs. the how.

    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics



  • 45.  RE: Reaching out to Data Scientists

    Posted 07-20-2016 14:29

    Fair enough, I understand you now that you clarified your use of quotes, which I thought served another purpose.  I suspect that we are both open to better ways to convey this.  Allow me to contemplate this while walking in my garden where I planted “carrots” and “vegetables” for consumption by “humans” and “animals.” 

    I am sympathetic to those, who want the data science terminology to go away—most of us want statistics problems to be called statistics problems and nothing more.  The problem is that three years ago IT convinced some of our clients that data science includes everything with data, even statistics problems.  In the field, we choose to keep the work, which means embracing the vague terminology along with 'applied statistics.'  

    Ron Wasserstein nailed it when he wrote: “facilitate … between statisticians and other data scientists.”  It is the ‘and other’ that helps applied statisticians in the field to retain those statistics problems, which are labeled under data science.  We should constantly mention the fact that statistical data scientists analyze the data and IT data scientists manage it.  Instead of asking ‘how can statisticians collaborate with data scientists?’ we might ask ‘how can computer scientists collaborate with data scientists?’  Finally, I am a greater fan of ‘Statistics/Data Science’ or 'Statistics and other Data Science' than ‘Statistics and Data Science.’ 

    ------------------------------
    Randy Bartlett



  • 46.  RE: Reaching out to Data Scientists

    Posted 07-20-2016 14:51

    A recent  interesting  online article from KDnuggets ( a "knowledge discovery" /data science e-magazine) about Data Science and Statistics.

    (The original article written by a statistician appeared about 2 years ago) 

    ..Why Big Data is in Trouble: They Forgot About Applied Statistics....

    Why Big Data is in Trouble: They Forgot About Applied Statistics

    http://www.kdnuggets.com/2016/07/big-data-trouble-forgot-applied-statistics.html

    ------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy



  • 47.  RE: Reaching out to Data Scientists

    Posted 07-20-2016 15:09
    Fair enough, I understand you now that you clarified your use of quotes, which I thought served another purpose.  I suspect that we are both open to better ways to convey this.  Allow me to contemplate this while walking in my garden where I planted "carrots" and "vegetables" for consumption by "humans" and "animals."

    I am sympathetic to those, who want the data science terminology to go away-most of us want statistics problems to be called statistics problems and nothing more.  The problem is that three years ago IT convinced some of our clients that data science includes everything with data, even statistics problems.  In the field, we choose to keep the work, which means embracing the vague terminology along with 'applied statistics.'  

    Ron Wasserstein nailed it when he wrote: "facilitate … between statisticians and other data scientists."  It is the 'and other' that helps applied statisticians in the field to retain those statistics problems, which are labeled under data science.  We should constantly mention the fact that statistical data scientists analyze the data and IT data scientists manage it.  Instead of asking 'how can statisticians collaborate with data scientists?' we might ask 'how can computer scientists collaborate with data scientists?'  Finally, I am a greater fan of 'Statistics/Data Science' or 'Statistics and other Data Science' than 'Statistics and Data Science.'


    Randy Bartlett
    LinkedIn Group: About Data Analysis
    Website: http://www.BlueSigmaAnalytics.com
    Analytics Magazine:
    http://goo.gl/Wod3gk , http://goo.gl/rZ7ys
    Datafloq Blog: https://goo.gl/8u31Ok
    Please visit the book and 'agree' to the reviews if they are helpful at:
    http://amzn.to/YGhXzv 


    ------Original Message------

    A recent  interesting  online article from KDnuggets ( a "knowledge discovery" /data science e-magazine) about Data Science and Statistics.

    (The original article written by a statistician appeared about 2 years ago) 

    ..Why Big Data is in Trouble: They Forgot About Applied Statistics....

    Why Big Data is in Trouble: They Forgot About Applied Statistics

    http://www.kdnuggets.com/2016/07/big-data-trouble-forgot-applied-statistics.html

    ------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------


  • 48.  RE: Reaching out to Data Scientists

    Posted 07-21-2016 02:47

    Dear all.

    you may be also interested by my overview of big data, data science and statistics at https://goo.gl/cE01Hs and/or http://goo.gl/dsXco1 (11th version as of April 2016).

    This corresponds to how I also explain it to managers and make the connection between statistics and data science; by underlining that both approaches of `learning from data' or `turning data into knowledge' are complementary and should proceed side by side - in order to enable proper data-driven decision making (see also http://www.statoo.com/en/datamining/).

    Greets from Switzerland

     Diego

     

    ------------------------------
    Prof. Dr. ès sc. Diego Kuonen, CStat PStat CSci
    CEO, CAO, PhD in Statistics

    Statoo Consulting, Switzerland

    http://www.Statoo.info

    http://about.me/DiegoKuonen





  • 49.  RE: Reaching out to Data Scientists

    Posted 07-21-2016 10:33

    I am just now getting around to reading all the posts.  Most interesting and a little disturbing.  I have to admit I balked at the term "Data Scientist" when it first appeared and I balked at the term "Big Data" for the same reason...it seemed ostentatious.  I have come to understand "Big Data" as a term for an amalgamation of data from many sources, often seemly disparate, that is "mined" for trends and correlations.  "Big Data" is never derived from controlled experimental design.  But that does not make it BAD data.  I now think of "Data Scientists" as those who STUDY DATA...just like a biologist studies biological systems and an economist studies economic systems.  The difference is that there are giant overlaps on where the data being studied by "Data Scientists" comes from.  In other words, a Data Scientist may study data mined into a Big Data problem that spans biology, economics, environment, and sociology.  We can't label the data specifically.  It is necessarily biological, economic, chemical, industrial, etc.

    Now to the argument of "Data Scientist" vs "Statistician"; All statisticians are data scientists.  But not all data scientists are statisticians.  Andrew made a good point.   Historically, statisticians were SCIENTISTS who took the extra step of trying to couch their science in mathematical and probabilistic terms.  However, not all scientists were statisticians.  The same is true today.  It is unfortunate that "Data Scientist" has become the label. "Data" is Latin for "information" or speficially, "something given".  The terms "Datologist" or "Informatologist" might be more accurate.  "Numerologist" is already taken!  

    The bottom line is this, DATA has come to be synonymous with "analysis".  You can't do analysis without data and you can't understand data without some form of analysis. "Statistics" is the "analysis of" while "Science" is the "study of".  So if your main focus is studying a system, you should be a scientist.  If you main focus is analyzing data from the system, you are a statistician.  Why invent new labels?

    I call myself a "Statistical Consultant" because I understand the methods of experimental design and analysis of data.  I consult with those who study systems on how they might better study (experiment or observe) and analyze their data from their field of study.

    All that said.  It is also helpful to have statisticians who are pure mathematicians.  These are the people who help us figure out the theories behind our observations and analyses. They are our educators and validators.

    Viva la différence!

    OK.  I'll step off the soapbox now.

    ------------------------------
    Susan Spruill
    Statistical Consultant



  • 50.  RE: Reaching out to Data Scientists

    Posted 07-21-2016 19:29

    I once was making a similar point to a statistician who adamantly resisted the idea because "not all statisticians analyze data." Understandably there are people who research statistical methodologies, but even then, is it possible, practical, or even credible to do so without ever analyzing data, or at least keeping in mind that what is being done is so that data can be analyzed effectively? Admittedly, I am coming from a very applied perspective, so any counterargument is strongly encouraged.

    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics



  • 51.  RE: Reaching out to Data Scientists

    Posted 07-28-2016 09:34

    With the recent rise of Data Science teams within General Mills, those of us coming from the long-standing "statistics" or "analytics" groups have been adapting to these new perspectives, and seeking collaborations.  So far:

    1) This is a positive change for our organization.  As Statisticians we have long fought for the smarter use of data to addresses the needs and challenges of the company.  Our leaders are listening as never before.

    2) Our Statistics folks have been welcomed into this community, and even viewed as leaders in some cases.  While I think others respect the Statistics knowledge we bring, they seem more drawn to our knowledge of the business and science problems and how to solve them. Our Statisticians have been playing this game a lot longer.

    3) We are not focusing on terminology or job titles - instead we are embracing the larger name of Data Science as a unifying theme.  The way I talk about it is that "Analytics" refers to the use of data and statistics (or ML) to support decisions and solve problems.  "Data Science" is the multidisciplinary program to bring Analytics to life.  

    4) Lots of people want to be part of "Data Science." So it needs to be a team sport. To help people see how they fit in, we highlight three main areas within Data Science: "Data" (we use "Connected Data" as our paradigm here, avoiding the connotations of big data); "Analysis" (essentially Statistics and related fields), and "Delivery" (the reports, online tools, or algorithmic implementations). Most people find they have skills and expertise in at least one of these areas, and this allows us to build the teams we need to solve problems.  We only have a few true "generalists" that span all areas.

    5) It is pushing our Statisticians to think more broadly about their skill set.  We already were pretty good with statistical programming and have created some operational tools.  But, for example, we realized the huge gaps in our ability to work with larger or more complex data sources, and in our knowledge of technologies to deliver models.

    6) Similarly, it is pushing others to learn more Statistics.  The interest in statistics and statistical thinking has never been higher.  So we are in a position to provide advice and coaching.

    We have a long way to go.  During my time at JSM, I am hoping to learn from others who are on this journey hand having some success!

    Fred

    ------------------------------
    Fred Hulting
    Director, Global Knowledge Services
    General Mills, Inc.