|
Register for the Spring Workshop!
Wei-Yin Loh, University of Wisconsin, Madison
New Developments in Classification and Regression Trees and Forests
Where: 35 W Wacker Dr. Chicago IL, 60601 (Leo Burnett Building)
When: Thursday, April 23th, 9am-4pm
What: Hands on Workshop with Lunch
Abstract:
Classification and regression tree models are unmatched for their interpretability, a feature that is lacking in "black-box" models such as those constructed by deep learning, tree ensembles, boosting, and other methods. Yet tree models have been falling out of favor in recent years for two main reasons. First, the prediction accuracy of tree models tends to be lower than that of black-box models. In particular, tree models often have lower accuracy than random forest models, which are ensembles of trees. Consequently, the latter have largely supplanted trees for prediction tasks. The second reason is not as well known but more important. Tree algorithms built on the CART (Breiman et al., 1984) paradigm are overly greedy in their search for a variable to split each node. They have a propensity to selecting variables (e.g., categorical variables with large numbers of levels) that allow more splits. This renders interpretations from the tree structures of dubious value.
This course focuses on the GUIDE algorithm (Loh, 2002), but other algorithms implemented as R packages, such as rpart, party, partykit, ranger, xgboost, are used for comparison. GUIDE is designed to be free of the selection bias of CART, not only in the selection of variables for splitting nodes, but also in its importance scores (the importance scores of Breiman's (2001) random forest are biased, but for other reasons). Unbiasedness is not, however, the only desirable property of GUIDE. Another unique feature is how GUIDE deals with missing values in the predictor variables, a common problem in data from sample surveys. While most algorithms employ implicit imputation of missing values (such as CART's surrogate splits) or send them randomly to the left and right subnodes at each split (party and partykit), GUIDE does not do imputation at all. Therefore, it does not require "missing at random" assumptions that are often difficult, if not impossible, to justify. Missingness itself is treated as qualitative information in GUIDE, and the tree diagrams explicitly show where missing values go at every split.
For a long time, the typically lower prediction accuracy of tree models versus black-box models seemed inevitable: you get either interpretability or accuracy, but not both. The current emphasis on "explainable AI" has renewed interest in algorithms that produce single-tree models with predictive accuracy on par with black-box models. The primary reason for traditional tree models having lower prediction accuracy is their being restricted to splitting each node on a single variable and to predicting the response in each terminal node with a constant, either the node mean (regression) or mode (classification). These restrictions were meant to maximize interpretability. One way to soften these restrictions while retaining explainability and improving accuracy is constructing trees with linear splits and linear regression or linear discriminant models in the nodes. These ideas have been tried before but never together. They are now implemented in the GUIDE algorithm and software. Empirical evidence based on real data will show that these new tree models have predictive accuracy comparable to or better than that of random forests, neural nets, and gradient-boosted trees. The new tree models can approximately "explain" which variables are utilized and how they are utilized in a black-box model, or they can be "explainable" replacements for them.
Real examples are used throughout to motivate key ideas in GUIDE as well as to demonstrate its range of application. Almost all examples contain missing data, which usually requires some pre-processing in other tree and non-tree algorithms, but not GUIDE. The datasets include Covid-19 electronic health records, consumer expenditure survey data from the Bureau of Labor Statistics, vehicle crash test data from the National Highway Transportation Safety Administration, cancer clinical trial data for precision medicine (differential treatment effects), data on live births in the U.S. from the CDC, and others.
The GUIDE algorithm and software have been continually updated and enhanced with new features for over 30 years. The first version of the software was released in 1997. Current executable versions for macOS, Windows, and Linux are available as a single zip file at https://pages.stat.wisc.edu/~loh/guide.html. There are links to a user manual, datasets, and pdfs of publications documenting its development. A live demo of the software is planned for the last hour or two of the course. Attendees wishing to learn how to use the software are encouraged to download the manual, data, and executable onto their laptops and bring them along.
Register Now!
________________________________________________________________________
|