If the goal is prediction, at worst a random forest will be exactly the same as a logistic regression. What makes RF and Neural Nets so much better is that they will break up continuous variable and "create" interaction terms. Suppose that you have "age" and "gender" as variables in the model. A main effects model only will look at the difference between male and female and look at the effect of adding one extra year to the age value. What if there is an interaction between between age and gender? The main effects model misses it completely. What if there are interactions between age and gender AND the groupings are different based upon gender. Such as, males between 18 and 30 have a low risk, females between 18 and 40 have a low risk, males 30 to 60 have a medium risk and females between 40 and 50 have a medium risk,..... You won't see that from a logistic regression (at least not very easily). You can see that very clearly with CART, Random Forests, Neural Nets, etc. I've gone back an re-examined several dozen "textbook" data sets using RF vs Logistic Regression. RF wins every time. The worst I ever saw was they were close to each other.
From the point of view of an investigator, you put all you variables into the model and let the software pick the best partitions. With Logistic Regression, you need to worry about collinearity among other things. With RF and NN, you just don't care.
With logistic regression, you tend to look for one model that is the "best". The question for the statistician is, "What makes this model 'best'? How much worse are the other potential models?" With RF and NN you generate hundreds of models and see what each model tells you.
With a logistic model, your software generated a model with the least "error". Suppose you have events occurring 2% of the time. Your logistic regression will likely claim events don't happen and be wrong 2% of the time. However, if those events are "avoidable death", then your logistic regression is a total failure. With RF and NN, you can "tune" your model to be really good at predicting "avoidable death". It will have a higher error rate than the logistic regression. But, you predict living patients will be in the "avoidable death" category. So, you can take extra precautions. So, your RF or NN model will be right say 90% of the time. But, for those with "avoidable deaths" you might accurately predict say 50% to 80% correctly. The logistic regression will be significantly less accurate because predicts everyone as living and is wrong 2% of the time.
Something to try at home, use your textbook data and some software that runs logistic regressions, CART, RF and NN models using the same criteria. Split your data into the training, testing and validation data sets. Let the software generate the models. Tune each model for minimal false positive and false negative rates. Do that with 30 other data sets. I can guarantee that the software will show RF and NN models are equivalent to or better than logistic regressions under the same tunings.
------------------------------
Andrew Ekstrom
Statistician, Chemist, HPC Abuser;-)
Original Message:
Sent: 05-16-2016 13:53
From: James Baldwin
Subject: Use Lasso Logistic Regression to Analyze Binary Data with
How can a logistic regression always be “poor” or at best “really bad” when the underlying model is a logistic model? I think you need to be a bit more specific about when to condemn logistic regression.
Jim
------------------------------
Jim Baldwin
Station Statistician
USDA-Forest Service