ASA Connect

 View Only
  • 1.  Improving student outcomes in math/Is reproducibility even a possibility presentations

    Posted 03-11-2024 08:27

    In January, I gave a couple of talks that most of you should find interesting.

    The first one is a comparison study of math students at Oakland University, Eastern Michigan University, Oakland Community College and Henry Ford College. The linked presentation is from a presentation I gave to Oakland Community College. I got student level data and was able to track student progress from their first math class to their last. Anyone interested in getting MORE STEM students in their programs, the info in it is really useful. I can even share all the data I used. The title of the talk is VERY appropriate: Mathemassacre

    YouTube link:  https://youtu.be/USFdGxNMjAU

    The second talk is from the Detroit ASA section meeting. The title of this talk: Is Reproducibility Even a Possibility? This ended up being a comparison study of R vs Python, Logistic Regression vs Decision Trees, and what important or "Statistically significant" terms pop up in each type of model under several different levels of signal to noise. The big take away here is, (And I do this with my students when I can) Give all of your students the same data set. Have them partition the data using different random seeds, say the last 4 digits of their student id. Then have them run whatever algorithm you are discussing. Then, report the "important features" or "Statistically Significant" terms on the board. Then discuss why no one found the same results. 

    Having done this before with "real" data, I found that the software might determine that say 5-8 variables are important at any time I use a given random seed. But, if I start at the beginning, and move forward with the same analysis, but use a different random seed, I'll get another 5-8 variables. However, only 1-2 will be the same between the 2 analyses. Repeat the analysis with different random seeds, you begin to see each model as a mere opinion and a second, third, fourth, fifteenth opinion is needed. It also hints at how Random Forests should be an ensemble method where the random seed changes each time a new tree is made, vs a new random starting point with the same partition. 

    YouTube Link: https://youtu.be/sYPvCE_au4Q 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 2.  RE: Improving student outcomes in math/Is reproducibility even a possibility presentations

    Posted 03-12-2024 09:00

    Andrew Ekstrom writes, "... software might determine that say 5-8 variables are important at any time I use a given random seed. But, if I start at the beginning, and move forward with the same analysis, but use a different random seed, I'll get another 5-8 variables. However, only 1-2 will be the same between the 2 analyses. Repeat the analysis with different random seeds ..."

    This idea is very close to an idea I called "near-optimization" that I promoted and tried to fund when I was a program manager for the Army Research Office.  The general idea is that in many statistics problems there are many models that do a good job of describing the data and that we may be better served by finding lots of good models rather than seeking the single best model.  To see why, suppose that all good models agree, say, that X1 is an important regressor.  Then we have more confidence in X1 than when some good models do not include X1.  I used to give a talk on this topic called "Suboptimal is Best, lots of it."

    To my dismay, few researchers would submit grant proposals on near-optimization even though I announced I wanted to fund research on that topic.  NSF take note: you should be funding this topic.



    ------------------------------
    Michael Lavine
    ------------------------------



  • 3.  RE: Improving student outcomes in math/Is reproducibility even a possibility presentations

    Posted 03-12-2024 12:20

    The first ASA presentation I gave was to the Ann Arbor chapter on optimization. As we learn in a operations research class, that chapter on sensitivity analysis tells us that we can change, at most, one coefficient at a time. Then, when we look at say a linear regression model, if Y = B0 + B1X1 + B2X2 + B3X12 + B4X1Xall those beta's are estimates, not fixed values. Something I found is that if say X2 is dichotomous, under some sets of beta's that could be part of the model, (All values comes from normal distribution with mean Bn and Std Dev) the optimal value of X2 WILL oscillate between each class. 

    In the presentation I gave, I discussed an experiment I did with my car to get the maximum MPG. The results of the model and the "optimal solution" suggested turning the AC up full blast was best. (In reality, AC set to 1 or opening all the windows really is the best.) I looked at the corner points of my design space, and found that if we stop the assumption of Y=10.00 >>>>> Y=9.99999999999999999999 and confidence intervals AND prediction intervals, do not exist, and allow the beta's to roam, some sets of corner points were optimal say 18% of the time, others 24% of the time, others 5%.... 

    I know that is part of what is going on here. In this case, it has to do with the small changes to the coefficients of the model induced by having slightly different data points. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------