I spoke with Brian Nosek at the Center for Open Science awhile back. He was interested in creating a "Dictionary of Statistical Terms". He wanted everyone to follow the same terminology as everyone else. After discussing something similar to what you are talking about, but without discussing "Data Science" terminology, I think he decided against it.
For example, If we talk about a "Mixture Model", what are we talking about? If you talk to someone dealing with probabilistic models, a mixture model is one that uses 2 or more probability distributions. If you talk to an economist, or at least the economist I had for econometrics, a mixture model has fixed and random variables. If I talk to industrial engineers and applied statisticians, a mixture model is the regression model you get from analyzing a Mixture Design. Among these realms of probability and statistics, who has the "correct" definition?
If you talk to someone in Design of Experiments about a "covariate", do they have the same thoughts and ideas about what that means vs someone from a psychology department? I don't think so.
Speaking of Mixture Designs, a DoE practitioner will talk about the components they used in their experimental design. Others will discuss how they want to model data where they have a combination of factors they want to test for and factors they want to record and use in the model but, are not the primary factors of interest....
Oh Oh Oh, did I just say "factors", in DoE factors are the variables we use and control in our experiments. They tend to be orthogonal or nearly orthogonal arrays of variables. How many people use "factor" to describe an "independent variable"? (Should we discuss factor analysis here?)
Since we are on the topic of "independent variables", why do we call them independent variables? if there is a serious amount of collinearity among the "IVs" they are not independent... The dependent variable depends upon what, the assumption that independent variables have some sort of relationship to it? (Can a variable be dependent upon another variable if there is no relationship between them?) Instead of IVs and DVs, how about we call them "Input Variables" and "Response Variables". That way, we get around the dictionary definitions of independence and dependence and you don't confuse anyone.
When it come to rows and columns in a database, those are tuples and features.
In my Intelligent Systems class, when we make a regression model, we have weights for our features. In my Stats classes, we have coefficients or Beta's for the terms in our regression model. There is also an idea about "standardizing" the features before using them in a regression model. Perhaps the difference between weights and coefficients.
I don't think we can criticize others for "misusing" terminology when we don't have solid definitions that are universally understood and applied.
This discussion also reminds me about "constants" in science. Depending upon what area of science you are in, alpha and beta have different definitions. If you try to make every constant unique, you go through the European alphabets and we all start learning Chinese characters..... then retrain every scientist to recognize those new definitions of constants.... which leads to the question, who decides what the constants stay as they were and what constants change?
Or, to the point of this discussion, suppose the ASA decides that in a regression model, you have coefficients and terms in your model and not weights for features, why does anyone else have to listen to ASA? Will the ASA send out goon squads to intimidate everyone that does not use their definitions? Does the ASA have a police force with international jurisdiction to arrest and or fine those that do not conform?
Or, we can recognize that there are different terms for the same thing in different areas of science. The same way there are different different terms for the same thing in English. (if you don't think so, go write me a letter on your Davenport, Then put on your toque and grab a 2 4 from the LCB and then relax on your Chesterfield or go throw some stones.)
Right now, the Canadians are smiling. ------------------------------
Andrew Ekstrom
Statistician, Chemist, HPC Abuser;-)
------------------------------
Original Message:
Sent: 03-06-2019 10:03
From: Gerald Belton
Subject: Terminology - Statistics vs Data Science vs Database
Sometime in the past few months, I remember seeing a web page that showed how different areas of data analytics used different terms for the same objects. For example, a database person says "rows" for what a statistician calls "observations," and the terms "columns" and "variables" and "features" all refer to the same thing.
Alas, I've tried a number of ways to search for this in Google and have been unable to find it again. I'm hoping someone here has seen a similar resource and can tell me where to find it.
Thanks!
Gerald Belton
North Carolina
------------------------------
Gerald Belton
------------------------------