ASA Connect

 View Only
  • 1.  Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-23-2018 19:50

    About a week ago Sayan Datta started a thread on the use of machine learning or artificial intelligence in data cleaning.  Perhaps this can be broadened to include the use of ML or IA incorporated into <g class="gr_ gr_11 gr-alert gr_gramm gr_hide gr_inline_cards gr_run_anim Grammar only-ins multiReplace replaceWithoutSep" id="11" data-gr-id="11">classic</g> statistical analysis.  Let me start the ball rolling.  In 1997 I presented some work done at Amgen to the Japanese Society of Computational Statistics describing ANCOVA using neural net techniques to model the covariate rather than assuming a linear relationship.  Since we were concerned that an opportunistic neural net might lead to underestimation of the error variance we used repeated random group assignment for hypothesis testing.  Our work was primitive and computationally taxing.

    I call to others to offer examples of the melding of machine learning and artificial intelligence in statistical analysis.

    Alan Forsythe

  • 2.  RE: Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-24-2018 11:06
    What do you consider to be Machine Learning or Artificial Intelligence?

    Usually, these are methods that use automated techniques to create a model. So, instead of looking at each iteration of a model, you let the software and some pre-determined "stopping" point determine the model. In such a case, anyone that uses a "stepwise" method is using some sort of ML or AI.

    I've used CART models and Random Forests to analyze survey data. I've used CART and Random Forests to create predictive models for warranty data. I've used simulations on dozens of systems. Some of the systems, I had to develop whole new methods for modeling and simulating them. 

    AI is a very broad area. In it's broadest form, most if not all stat methods are used in AI.

    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)

  • 3.  RE: Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-24-2018 11:43
    In my nearly 40 years of analysis I have seen a lot of bad models built by automated/AI tools and very few good ones. Especially in areas where reason codes are required, like insurance or credit. There are definitely places where automated tools can be helpful, like in eliminating variables of concern or identifying non-linear relationships or outliers; however, there is no replacement for manually reviewing results at every step of analysis. Sure, automated tools can save time but the downstream costs can be expensive.

    The use of automation without review must be determined by the importance and costs of the decisions made by the tools or based on the results of any analyses. Also, senior management must understand when decisions they make are based on some unreviewed analysis so they can evaluate the costs of potential errors.

    As we all know, automated decisions have made many industries more efficient and fair by allowing huge numbers of decisions to be made in marketing, underwriting, credit, with minimal or no human intervention. The flip side of this is that huge numbers of bad decisions are made when the tools have errors. Automated production of these tools increases the probability of bad decisions from tools made without manual review.

    Michael Mout

  • 4.  RE: Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-25-2018 11:19

    Automated models can succeed or fail, and getting one to work can be labor-intensive.  But after getting one to work, less labor is expected if they need to be run repeatedly.

  • 5.  RE: Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-25-2018 13:06
    s George Box said, "All models are wrong. Some models are useful."

    With the latest data sets I've worked on, traditional logistic regression models were utter failures. They lead to false paths.The only reason we got to look at the data was because the statisticians couldn't get a useful model from the data. In 20-30 minutes per data set and using some Data Mining techniques, I had models that pinpointed sub-contractors that made faulty parts and sets of conditions where items will become less reliable. That saved the company a few million dollars.

    With a "survey data" set, we had 600 responses, 150 variables and magically, 10 Data Mining models converged on 10-15 "significant" factors. A subsequent analysis of those factors found a single office was responsible for about 40% of the bad reviews.

    With all of those data sets, the logistic regression we made were highly accurate over all. They could predict the outcomes 99.99% of the time. But, they couldn't predict the events of interest. The data mining models were able to predict both events and non-events well. 

    I'd argue that we need to remove human biases from the analyses as much as possible. We, as humans, suffer from a range of psychological issues where our brains make connections that are not there. Just think about all the folks that believe in conspiracy theories (apophenia) or the folks that see faces in things (pareidolia), or how humans are highly suggestible. We crave finding patterns when there are none. We see what we want to see, (think about all those bad relationships we've had). When confronted with data that does not support our predetermined ideas, we reject it. (Think about the hole in the ozone layer or the scientists that argued smoking was not that bad or "flat eathers".)  

    If we look into the AI models used in autonomous cars, I'll take a Tesla on auto-pilot over most other drivers on the roads. Tesla mistakes have killed how many people over the last 5 years? a dozen or so? The human drivers in my state have killed that many this past week. (We average about 2 a day statewide.) If all drivers used autopilot, death of vehicle passengers will drop to about 5%-10% of what it is now. 

    We can all come up with times when traditional methods worked great and utterly failed. We can all come up with times when AI and ML model worked great and utterly failed. I think we all need to work on making these methods better. We need to use diagnostics beyond "looking" at plots. We need to go beyond "maximizing" values. 

    We need to ask our models if they are doing what we need them to do. We need to adopt ideas from AI/ML and use them in traditional statistics. We need to adopt ideas from traditional statistics into AI/ML. 

    Most importantly, we need to listen to our data and remember, "All models are wrong: Some models are useful" That means there is room for improvement..... and Jim Jeffries is correct, "We can all do better." 


    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)

  • 6.  RE: Use of machine learning or artificial intelligence in classic statistical analysis

    Posted 01-24-2018 13:51
    I have proposed a strategy called "Local Control," LC, for Comparative Effectiveness Research (CER) using observational (health care) data ...i.e. data where differences in experimental unit (patient) X-confounder characteristics are expected between treatment cohorts. This analysis strategy is "local" in the sense that it starts by using clustering methods to form (many) subgroups of units relatively "well-matched" in X-space.

    In the Confirm phase of LC analysis, I have proposed a nonparametric (permutation) test of the hypothesis that the available X-confounder characteristics are ignorable. When evidence against this hypothesis is strong, attempts at modeling of "local" effect estimates to show that they are predictable "fixed" effects (i.e. heterogeneous treatment effects) is justified.

    As in Alan's post, LC Confirm testing uses repeated random re-assignment of units (patients) to subgroups of the same sizes as the observed X-based clusters to generate a NULL distribution. This is indeed computationally taxing, especially when N = number of units and K = number of subgroups are both large. But the researcher can then literally "see" the Observed and NULL distributions being compared empirical CDFs.

    Readers interested in learning about LC should visit my website for case-studies and published papers:

    Bob Obenchain