ASA Connect

 View Only
Expand all | Collapse all

Reproducibility in statistical analyses

  • 1.  Reproducibility in statistical analyses

    Posted 02-10-2023 10:39

    Hello Everyone,

    A few months ago I gave some assignments to my Data Science students. I gave them a data set. I told them to partition the data into a testing and training set. Then run various analyses on the data. Over the course of the class, they made Logistic Regression, CART and Random Forest models, among others. The data itself had 20-30 variables to pick from and about 10,000 rows. These were In Class assignments and activities. I had students write the results of their analyses on the board so we could all see. 

    The end result was, most students didn't have the same answers as far as what variables in the data were "important". I had each student use a different random seed. Unlike a lot of the standard data sets out there for use, I made up these data points. So, I know, for a fact, what variables were used to create the response variables. I know how much randomness I added to each data set. 

    No one ever got all the right variables from their analyses. However, by using all of there results, we were able to find 60% to 80% of the variables I choose for the models. We were also able to eliminate 90%+ of the spurious correlations. We were even able to "roughly" categorize the variables into "Most important", " Very Important", "Most likely random" and "Not important" based upon how often certain variables came up in models and those groupings most correlated to the size of the coefficients I used to create the models. 

    For example, If X1, X3, X4 come up 100% - 90% of the time, X9, X13, X17 come up 90% to 70% of the time, and variables X7, X11, X12 and X28 never come up, we can classify these variables as "Most important", "Very Important" and "Not Important".   

    I've been looking through the research a bit on reproducibility in statistical analyses. There is a lot of talk about making sure software is able to accurately reproduce results. But, I am not aware of any articles that discuss or "prove" that The nature of data itself might not lend itself to good reproducibility... at least not under certain conditions. 

    Is anyone aware of studies that show data analyses techniques are not that reproducible? 

     



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 2.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 07:04

    Dear Andrew,

    Your result is not surprising to me. The first time I came across such issues was more than 15 years ago. One paper from that time is from the Erlangen group https://pubmed.ncbi.nlm.nih.gov/17019510/

    Frank Harrell has been claiming problems with variable selection constantly. The best reference to this probably is section 4.3 in his book Regression Modeling Strategies (2nd ed; DOI 10.1007/978-3-319-19425-7).

    In our own work, we use the following approach if we use feature selection: we run a cross-validation -- typically something like a 10-fold CV --, do the model building repeatedly within each fold. The more often a feature is selected, the more stable the final model is. However, to repeat Frank Harrell's observations and concerns in my own words: you are generally unable to identify the true underlying model. However, you are generally able to come up with a reasonably good prediction model.

    I would like to hear what is the opinion of others.

    Andreas



    ------------------------------
    Andreas Ziegler
    Prof. Dr.
    ------------------------------



  • 3.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 15:15

    The first time I did this type of thing, I had 600 rows of data. About 40 "positive" results and 560 "negative" results. All coming from a survey a company did. No one result told us much about what the "true" drivers were for the bad (positive) results. Using 10 CART models and 10 MLogReg models, we got it down to 2 offices for this company accounted for 80% of the bad reviews. 

    The prof we were working with at the time, this was for a capstone course, thought we were stupid for doing it this way. He told us the results were meaningless because we should get nearly the same results every time. Once we had the results, we presented them anyway to the company at the Capstone Presentations seminar. The company loved what we did. The other companies at the presentations came up to us and asked us questions about how to improve their analyses. Meanwhile, our prof was FUMING.... because, "You don't do it like that!" But, after hearing all the buzz we created, and how others were having the same issues, he had to change his mind. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 4.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 07:51

    Hi, Andrew:

    I wonder if you can identify the reasons for the failure to obtain the correct model.  One possibility is high multicolinearity, so that slight differences in correlations change which variables enter.

    10,000 observations, even split, seems like enough to provide accurate estimates.  Were any of the analyses simulations that may have had too few runs?

    What else?

    Could you identify a "better" method that would more often be successful if you did it yourself as an expert?

    Ed



    ------------------------------
    Edward Gracely
    Drexel University
    ------------------------------



  • 5.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 15:27

    The data I used was randomly generated. So, each variable should have minimal VIF. 

    The results had varying degrees of variability. The lower the signal to noise ratio, the less variables we would find. 

    What  I wanted to do, at least for research publication purposes, is write some code that would take the original data set, create the model, start over with a the same data set and use a new random seed, run through the process over and over, say 20-50 times. And produce say a histogram on how often each variable was chosen. 

    The goal was to then use the "chosen" variables and run say one final model using only those variables that made it past our criteria....  

    There were 2 different ideas behind this. 

    1) Don't both with your "Hyper Parameter Tuning" until AFTER you have a good idea about what was REALLY important. 

    2) Don't trust the first model you generate. 

     

    Each time we got a model, the "accuracy" of it was nearly the same, +/- 1% to 2%. As far as Kaggle competitions go, that could be the difference between winning and last place. 

    All of that was going to be Part 1 of my PhD Thesis. Part 2 was going to deal with missing data. Part 3, optimizing the final model. But, I couldn't find anyone to work with. So, now I just show my students what to do and not to do. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 6.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 09:05

    Similar to what you describe, Silberzahn et al. (2018) demonstrated substantial variability across different researchers' models for the same dataset. They asked several teams to independently analyze the same dataset (real data, not simulated, IIRC) and the same overall research question: "whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players." Each team came up with a different final model, and their conclusions ranged from strongly positive, to no-evidence-of-an-effect, to moderately negative.

    "Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results"
    Silberzahn et al. (2018)
    https://doi.org/10.1177/2515245917747646

    There have been other similar studies. The most recent one I've seen, Breznau et al. (2022), cites other examples in the paragraph just before their Methods section. I personally agree strongly with their own conclusion that scientists "should exercise humility and strive to better account for the uncertainty in their work."

    "Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty"
    Breznau et al. (2022)
    https://doi.org/10.1073/pnas.2203150119



    ------------------------------
    Jerzy Wieczorek
    Assistant Professor
    Colby College
    ------------------------------



  • 7.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 09:19

    My other thought is that you say "I told them to partition the data into a testing and training set." Even with large sample sizes, there is a lot of variability across different random train/test splits of the same dataset. Further, if your goal is to choose the correct model (not just a model that makes good predictions) and you're working with parametric models, asymptotically your split ratio n_train/n_test must go to 0 to ensure model-selection consistency (Shao 1997 https://www.jstor.org/stable/24306073 ; Wieczorek and Lei 2021 https://doi.org/10.1002/cjs.11635 ), though not necessarily for selection of nonparametric models (Yang 2007 https://doi.org/10.1214/009053607000000514 ). In other words, if you ask teams to form random splits where n_train > n_test as usual, you should expect each team to choose a *different* model based on test-set performance -- even if all the teams considered the same set of models (which it seems they did not).



    ------------------------------
    Jerzy Wieczorek
    Assistant Professor
    Colby College
    ------------------------------



  • 8.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 07:29

    I think the problem is that, in fact, there is no correct model. Assuming the variables were intercorrelated, there will be lots of models with trivially different "explanatory" power. One of your "important" variables could be irrelevant in the presence of some other set of variables. For data sets of this type, you can seek a parsimonious set of predictors for some outcome variable but thinking there is a correct set assumes you can tease causality out of correlational data. I remember decades ago reading in a statistics book in psychology that asserted that analysis of covariance can be used to detect true mean differences in non-experimental data if you include all important competing explanatory variables as covariates. One of the few times I've felt like burning a book (and I was still a grad student at the time). 



    ------------------------------
    Chauncey Dayton
    ------------------------------



  • 9.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 15:16

    With the data I used, there was "a model". But, I wanted to see if anyone could get all the "important variables". 

    There was no correlation between variables. Each was randomly generated. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 10.  RE: Reproducibility in statistical analyses

    Posted 02-15-2023 13:41

    This situation is an excellent opportunity to introduce the students to George Box: "Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful."



    ------------------------------
    George Rodriguez
    Computational Chief Scientist
    ------------------------------



  • 11.  RE: Reproducibility in statistical analyses

    Posted 02-15-2023 19:23

    That was a part of my goal. But, the dept head didn't like the fact I tried to prepare students for real data and the working world... 

    When someone asks me If I would ever get a tattoo, I tell them it would be, "All models are wrong: Some models are useful." 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 12.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 12:08

    The closest article I can think of is "Many Analysts, One Data Set" from https://journals.sagepub.com/doi/10.1177/2515245917747646 but that's a little different - in that article, each team is given the whole data set - but it sounds like you gave each of your students a random sample, which would introduce some extra variability.

    I'd also mention Gelman's forking paths paper which is a bit more philosophical.

    By the way, you mentioned your aggregated results screened out 90% of spurious variables - if you were using a 5% sig level I would have hoped it was closer to 95% (but maybe it was rounded coarsely or my mental math is off).

    Best,

    Neal



    ------------------------------
    Neal Fultz
    ------------------------------



  • 13.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 15:05

    I gave my students the same data sets. Then in say week 3, we discussed Multiple Linear Regression. I had all my students run MLR and report their results. Week 4 was Multiple Logistic Regression. Everyone got the same data set and ran MLogR. Week 5 CART models, Week 6 Random Forests, etc.  

    For each analyses, they made their model using the same code, except for the random seed. 

    For removing spurious correlations, we looked at the results everyone had. For example, if I use X1, X3, X7, X18 and X19, we KNOW what variables should be in the model. If we look at the results from each student, (16 in total), we find:

    Student A) X1, X3, X5, X18

    Student B) X1, X5, X18, X19

    Student C) X1, X3, X7, X17

    etc. 

    At the end suppose:

    14 students reported X1

    2 Students reported X2

    13 Students reported X3

    3 Students reported X4

    7 Students reported X5

    Etc. 

    If we had say 12 or more students report X1, X3 and X18, We would call those variables "Highly Important"

    If we had say 8 students report X7, 7 students report X5 and 6 students report X19, we would call those variables "Somewhat important" or "Worthy of further study".

    Since every other variable was reported 1-2 times, we can say those other variables "not important". In this case, we removed 14 of the 15 spurious correlations from further consideration. 

    In another data set, with the same important variables, we might only find X3 is the only important variable. In this case, we removed 15 of 15 spurious correlations AND 4 of the 5 important variables. 

    What made these data sets different is the number of "true" positives (in the case of dichotomous outcomes) and the number of "false positives". The "true" positives were 1%, 5% or 10% of the data. The false positives were 1%, 5% and 10% of the data. So, there were those 9 data sets in total (Every combo of true and false positives), for dichotomous outcomes. (And one that was just totally random..... Cuz I am EVIL like that;-) 

      



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 14.  RE: Reproducibility in statistical analyses

    Posted 02-13-2023 19:24

    To me, Ekstrom's post illustrates the need to add a new paradigm to our statistical repertoire.  For many of the data sets I have worked with, there is no true model.  For those data sets, finding the true model is not a sensible goal.  A more sensible goal is to find many disparate models that all do a good job of describing the data.  When we find many good models then we can answer questions like "Do all of these models have a substantially positive beta_1?" or "Is it possible to construct good models that don't use X_1?"

    Presently most of our algorithms are geared toward producing one model.  It would be useful to create new algorithms that are geared toward finding many good models, and it would be even better to find good models that differ substantially from each other.  In my opinion, such model searches would be useful even in applied problems that have a true model.

    I know of only a very few papers that make a start on constructing such useful algorithms.  They are by Rohan Joseph, Cynthia Rudin, myself, and a few others.



    ------------------------------
    Michael Lavine
    ------------------------------



  • 15.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 15:32

    Among the many things I tell my students, is that there is some type of a model. That doesn't mean you have the right information to find it. So, we do our best to get the best thing we can. 

    Something I try to avoid when creating a model, is chasing ghosts. If I had a dozen models, all with about equal abilities, I would look for what terms keep coming up. If all I had was a single model, I'd use some of the methods I teach my students for determining if a variable is reproducible in future studies. On a TI-83/84, NormCDF(MOE, inf, abs(mean), Std Err). 

    I'm not a big proponent of "Making sure the variables/results make sense". because every breakthrough paper I've read, and every breakthrough data set I worked on had variables that didn't make sense. and that is how I/they overcame an issue with the system. 

    The proposal I had for dealing with this is to take the total data set, break it down into testing and training (and validation if you can) data partitions using a random seed. Run your model. See the results. Go back to the original data set. Partition it with a second random seed. Run your algorithm, see the results. Repeat these steps a dozen or more times. Then use Bayesian Analysis to determine if some variables are more likely to contribute to the outcome.  

    My thought here is to eliminate spurious correlations as much as possible. Id' rather have the "2-3" most important variables than all the "important variables" and some that are figments of the algorithms imagination.... i.e. Ghosts (in the machine) if you will. 

    Ill take a look at your paper. I might even use it as a journal assignment this term with my Biostats students. Thanks.



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 16.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 10:57

    Not published anywhere, but I once read a paper that used Lasso for confounder selection. The results were nonsensical (e.g., lagged variables from N-2, N-3,N-8, were "important") and obviously methodologically unsound if you are trying to find an unbiased estimator for whatever effect they were interested. I was going to write a response with a simulation study, but alas never got to it. Here are some simulation results (lasso selection on github). I looked at the Lasso for a set number of N, varied number of predictors, true value of one parameter, the overall measurement error, and correlation. The results are as expected...low signal to noise, less likely to recover the true parameters from the data generating process. Code is four years old so take with a grain of salt...



    ------------------------------
    Michael DeWitt
    ------------------------------



  • 17.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 15:33

    Cool. I'll take a look at it. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 18.  RE: Reproducibility in statistical analyses

    Posted 02-14-2023 15:35

    Thank you all for your comments. Please keep them coming. I am thankful for all of you who have and will participate. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 19.  RE: Reproducibility in statistical analyses

    Posted 02-15-2023 07:29

    Hi Andrew,

    I do a series of similar exercises with my Design of Experiments students but my exercises are much more basic than yours. The data for these exercises are generated with macros so the underlying model is the same but there is superimposed random noise. When we use standard experiment designs everyone gets similar answers. Later in class, to emphasize the value of using a designed experiment, I ask them to analyze data sets with more complicated structure and of course their results diverge. We talk about some of the special methods available to analyze such data but those are beyond the scope of the class. If I am successful, the students learn to prefer designed experiments and distrust models built from happenstance data. I am lucky that most of my students come from the engineering and science community where they have some control over how their data are collected and they can choose to used designed experiments.

    Long ago Angela Dean from OSU or Susan Collins from Lubrizol (I think they were working together) told us about a data set from a supersaturated experiment design that was published with a request for people to analyze the data and submit their results. A prize was offered for the best analysis. The purpose of the exercise was to assess analysis reproducibility. My recollection was that a small subset of the analyses submitted were on target but most missed by a wide margin - similar results to yours. 

    Kaggle's competitions provide a similar opportunity to assess analysis reproducibility. I haven't invested any time there so I don't know what kinds of problems have analyses that are shared. Maybe someone has already written something up? If would be fun/malicious to assign Kaggle projects with shared analyses for students to do their own analysis reproducibility evaluation.

    I'm going to share this thread with my students in the future when we reach this topic.

    Paul Mathews
    Mathews Malnar and Baily, Inc.



    ------------------------------
    Paul Mathews
    President
    Mathews Malnar & Bailey, Inc.
    ------------------------------



  • 20.  RE: Reproducibility in statistical analyses

    Posted 02-15-2023 19:20

    I've done something like that too. But for personal interests. 

    its kinda fun to create a data set that seems "genuine" and has all of the same issues as you would find or could find in real data. 

    For one fake data set I made, I had say, Y = B0 + B1X1 + B2X2 + B3X3 + B4X1X2 + B5X2^2, etc. Then add a error term with a Std Dev of say 1, 5, 10, 25 and see how many of the coefficients you can find.  

    For another data set of fake data, I used nearly the same model, but, I made a couple "small" changes. Y = B0 + B1(1.05*X1) + B2X2, etc. That way, it looks like as the X1 value increases, the error assocciated with it increases. Or, if I had say a 16 run factorial design, the first 4 runs had a bias of +1.50. Runs 5 to 8 had a bias of -4.5, runs 9 to 12 had a bias of +2.5, runs 13 to 16 had a bias of +1.0. I analyzed that data assuming nothing was wrong. Then went back in and found what hte mode would be if we included everything. 

    As a chemist with about 10 years experience working in an analytical lab, I know that some of the instruments, and literally 100% of the instruments I used, had an increasing baseline. Such that in Internal Standard (IS), a chemical of a known concentration dissolved into the solution, would come up on my instruments as a function like: Reported Conc of the IS = 70% + 2.5%*sample position + error. So, the IS would report 72.5% on the first sample and report 95% on the 10th sample. 120% on the 20th sample, etc. 

    Sadly, when I interview for statistics positions that use laboratory data like this, I keep getting into arguments with the hiring manager. I know this happens. They don't believe it happens.   



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 21.  RE: Reproducibility in statistical analyses

    Posted 03-05-2023 16:43

    I am hosting a a symposium for the American Chemical Society this summer. I've been looking for a topic to discuss during my "Data Science in Chemical Research"  talk. I haven't been able to come up with anything new this year.... Until about 20 mins ago. 

    Do you mind if I borrow your idea about the analysis of "designed experiments" vs "observational" data analysis? 

    I think a title like, "Is scientific reproducibility a fallacy, a misunderstanding or sign of poorly designed experiments?"   who knows, I might actually get more than 5 people watching.... 

    I teach my students about the reproducibility of results vs conclusions. Adding in my data analysis as well as your idea for generating data with a macro, I should be able to show the power of designed experiments.... which brings up the another challenge. How do you "prove" to scientists, that have been doing research for decades, that their OFAT methods are bad AND that you can and should change more than one thing at a time?



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 22.  RE: Reproducibility in statistical analyses

    Posted 03-06-2023 15:02
      |   view attached

    Andrew, 

    In response to your question:  How do you "prove" to scientists, that have been doing research for decades, that their OFAT methods are bad AND that you can and should change more than one thing at a time?

    My experience has been that when they see/experience an important interaction of multiple factors that gives insight (or even solves) a problem they are interested in, they become open to learn about mult-factor designs.  I have found this situation much worse with health scientists (clinicians and researchers) than in other industries.  I put together a research team during Covid to try and understand why and found some interesting history that made this happen. 

    We wrote about this in attached paper: It Is Time to Reconsider Factorial Designs: How Bradford Hill and R. A. Fisher Shaped the Standard of Clinical EvidenceQuality Management in Healthcare in Vol. 29, No. 2, April/June 2020.



    ------------------------------
    Lloyd Provost
    Associates in Process Improvement
    ------------------------------



  • 23.  RE: Reproducibility in statistical analyses

    Posted 03-07-2023 02:52

    At a couple different talks, I tried a few different methods to "prove" you can change more than one thing at a time. The first was, I took a poll. I asked if you can change more than one thing at a time during an experiment. 100% said no. I asked why they felt that way. They said you can't tell what caused the outcome. I bet them I could make them change their answer. 

    Admittedly, I don't know what is the best way to do this. But, I know these don't work. 

    1) I discussed an experiment to make NaCl from NaOH and HCl. (Table salt from Sodium Hydroxide and Hydrochloric Acid.) The concentration of NaCl made, [NaCl] = min([NaOH],[HCl]). By using both, will you get any table salt. To increase the yield, you must increase both at the same time. 

    Scientist A performed the experiment with the factors and levels:

    Run [HCl] [NaOH] [NaCl]

    1         0.0      0.0    0.0

    2         0.0      0.5    0.0 

    3         0.0      1.0    0.0

    4         0.5      0.0    0.0

    5         1.0      0.0    0.0

    This scientists concludes you can't make table salt from HCl and NaOH. 

    Scientist B performed the experiment with the factors and levels:

    Run [HCl] [NaOH] [NaCl]

    1         0.5      0.0    0.0

    2         0.5      0.5    0.5 

    3         0.5      1.0    0.5

    4         0.0      0.5    0.0

    5         1.0      0.5    0.5

    This scientist shows you CAN make NaCl.... But, yo ucan't make more than 0.5 mole of it. 

    Scientist C performed the experiment with the factors and levels:

    Run [HCl] [NaOH] [NaCl]

    1         0.0      0.0    0.0

    2         0.0      1.0    0.0 

    3         1.0      0.0    0.0

    4         1.0      1.0    1.0

    5         0.5      0.5    0.5

    This scientist proved you can make as much NaCl as you want. You just need to increase both proportionally. 

    Someone stood up and said, "This is a stupid example. Everyone knows you need to change both at the same time!"  (I won the bet! and lost teh audience.) 

    I mentioned that, "Scientists A and B used OFAT methods and got the wrong answer. When I asked if you thought you should change more than one thing at a time, you all said no. And you all know that is wrong." 

    For another talk, I discussed a system of linear equations. Each coefficient was different. Yet, you can still figure out the values of X, Y and Z. This didn't work either. So, I wasn't even going to try to explain how to solve overdetermined systems of linear equations.  



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 24.  RE: Reproducibility in statistical analyses

    Posted 03-06-2023 16:23
      |   view attached

    Hi

    About OFAT which is also referred to as OVAT or one variable at time sensitivity analysis of forecasting models, there is a big problem when multiplicative variables are involved (by use of log transforms). Back in 1994, I was using Latin Hypercube Sampling (LHS) to perform sensitivity/uncertainty analysis for computer models.  During that time I spent some time trying to familiarize myself with classical experimental design.  My finding was that the LHS design did not experience confounding problems, typical of classical experimental design. Attached copy of the draft report I prepared back then summarizes these findings. Even though the date stamp on the pdf file is for 2008, the draft copy was prepared back in 1994. I hope, you find the summary findings informative in your work.



    ------------------------------
    Ramesh Dandekar


    ------------------------------

    Attachment(s)

    pdf
    experimental design.pdf   99 KB 1 version