ASA Connect

 View Only
  • 1.  Terminology - Statistics vs Data Science vs Database

    Posted 03-06-2019 10:04
    Sometime in the past few months, I remember seeing a web page that showed how different areas of data analytics used different terms for the same objects. For example, a database person says "rows" for what a statistician calls "observations," and the terms "columns" and "variables" and "features" all refer to the same thing. 

    Alas, I've tried a number of ways to search for this in Google and have been unable to find it again. I'm hoping someone here has seen a similar resource  and can tell me where to find it.

    Thanks!

    Gerald Belton 
    North Carolina

    ------------------------------
    Gerald Belton
    ------------------------------


  • 2.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-07-2019 09:37
    This is an area against which I have long protested. The problem with Data Science is that the practitioners have minimal experience in any one of the three areas of study that make up the field of study, programming, data base management, and statistics, often resulting in a "jack of all trades, master of none."

    As is pointed out in the previous post this often results terminology confusion. Stats is generally quite exact in thier terminology, while other areas are not. Columns and rows can mean any number of things in programming and data bases whereas observations and variables are well defined in statistics.

    As I have mentioned in previous publications, if an employer wants to have statistical analysis done, especially sophisticated analysis like modeling, then they should use a statistician, not a data scientist or as a bare minimum make sure the data scientist had the appropriate background and experience in stats.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------



  • 3.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-08-2019 08:03
    I agree wholeheartedly with Michael Mout. 

    As someone who works in databases all day, every day, and who is also in a masters program for applied statistics, I agree when Michael states "Stats is generally quite exact in <its> terminology, while other areas are not. Columns and rows can mean any number of things in programming and data bases whereas observations and variables are well defined in statistics."

    There are many database professionals who have very specific definitions of rows and columns.  Joe Celko is the one that is top of mind for me Joe Celko - Wikipedia.  Here is a blog demonstrating however, that there is much conversation around topics like this even within the database world What is the difference between a "record" and a "row" in SQL Server?


    So the answer to your question Gerald is, it depends. It depends on the context of the question.  It depends on the backgrounds of the practitioners discussing the problem.  It depends on the company or agency you are working for.  And so on.



    ------------------------------
    Jennifer Mahoney
    Business Intelligence Developer
    Spectra Logic
    ------------------------------



  • 4.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-08-2019 08:09
    Thanks for all of the responses, some of the links I received in private email have been very helpful. The ensuing discussions have also been enlightening. 

    All I was really after was a list I could give to my community college Data Analytics students, so I could say "In this class, we are going to call these the 'independent variables' but don't get confused if you hear someone call them 'features' or 'predictors.'"



    ------------------------------
    Gerald Belton
    ------------------------------



  • 5.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-07-2019 13:34
    I spoke with Brian Nosek at the Center for Open Science awhile back. He was interested in creating a "Dictionary of Statistical Terms". He wanted everyone to follow the same terminology as everyone else. After discussing something similar to what you are talking about, but without discussing "Data Science" terminology, I think he decided against it.

    For example, If we talk about a "Mixture Model", what are we talking about? If you talk to someone dealing with probabilistic models, a mixture model is one that uses 2 or more probability distributions. If you talk to an economist, or at least the economist I had for econometrics, a mixture model has fixed and random variables. If I talk to industrial engineers and applied statisticians, a mixture model is the regression model you get from analyzing a Mixture Design. Among these realms of probability and statistics, who has the "correct" definition? 

    If you talk to someone in Design of Experiments about a "covariate", do they have the same thoughts and ideas about what that means vs someone from a psychology department? I don't think so. 

    Speaking of Mixture Designs, a DoE practitioner will talk about the components they used in their experimental design. Others will discuss how they want to model data where they have a combination of factors they want to test for and factors they want to record and use in the model but, are not the primary factors of interest....

    Oh Oh Oh, did I just say "factors", in DoE factors are the variables we use and control in our experiments. They tend to be orthogonal or nearly orthogonal arrays of variables. How many people use "factor" to describe an "independent variable"? (Should we discuss factor analysis here?) 

    Since we are on the topic of "independent variables", why do we call them independent variables? if there is a serious amount of collinearity among the "IVs" they are not independent... The dependent variable depends upon what, the assumption that independent variables have some sort of relationship to it? (Can a variable be dependent upon another variable if there is no relationship between them?) Instead of IVs and DVs, how about we call them "Input Variables" and "Response  Variables". That way, we get around the dictionary definitions of independence and dependence and you don't confuse anyone. 
    When it come to rows and columns in a database, those are tuples and features.

    In my Intelligent Systems class, when we make a regression model, we have weights for our features. In my Stats classes, we have coefficients or Beta's for the terms in our regression model. There is also an idea about "standardizing" the features before using them in a regression model. Perhaps the difference between weights and coefficients. 

    I don't think we can criticize others for "misusing" terminology when we don't have solid definitions that are universally understood and applied. 

    This discussion also reminds me about "constants" in science. Depending upon what area of science you are in, alpha and beta have different definitions. If you try to make every constant unique, you go through the European alphabets and we all start learning Chinese characters..... then retrain every scientist to recognize those new definitions of constants.... which leads to the question, who decides what the constants stay as they were and what constants change?    

    Or, to the point of this discussion, suppose the ASA decides that in a regression model, you have coefficients and terms in your model and not weights for features, why does anyone else have to listen to ASA? Will the ASA send out goon squads to intimidate everyone that does not use their definitions? Does the ASA have a police force with international jurisdiction to arrest and or fine those that do not conform? 

    Or, we can recognize that there are different terms for the same thing in different areas of science. The same way there are different different terms for the same thing in English. (if you don't think so, go write me a letter on your Davenport, Then put on your toque and grab a 2 4 from the LCB and then relax on your Chesterfield or go throw some stones.) Right now, the Canadians are smiling. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 6.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-08-2019 11:06
    Well said, Andrew!!

    ------------------------------
    William DuMouchel
    Chief Statistical Scientist
    Oracle
    ------------------------------



  • 7.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-08-2019 11:15
    I agree that it is confusing even when you stick to traditional areas of Stats; however, when you through in Data Base developers and Programming (as in "Data Science") things get even more confusing.

    ------------------------------
    Michael Mout
    MIKS
    ------------------------------



  • 8.  RE: Terminology - Statistics vs Data Science vs Database

    Posted 03-10-2019 16:54
    Part of the issue is that over the last 100 years or more, many problems in statistics were solved by people who started out in other disciplines.  In addition, many world class applied statisticians started out in other disciplines.  Therefore, many names of things came from other disciplines.  I don't think that will change because the main reason for using statistics is to solve problems in other disciplines.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org
    ------------------------------