ASA Connect

 View Only
  • 1.  Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-15-2016 00:27

    I am involved with a medical research to analyze Coronary Artery Disease. The dataset has a couple of predictors such as age, gender, race, certain symptons and medical standard procedures to be diagnosed as CAD disease. Most of them are binary, (like whether the patient smokes, etc.) and the rest are continuous (like blood pressure, or certain hormone levels) The outcome variable is whether the patient has CAD disease or not (binary).

    The research question is to build a model to find variables of the most interest and better predict. My idea is to perform a Lasso Logistic Regression to select the variables and look at the prediction. I did some research online and find a very useful tutorial by Trevor Hastie and Junyang Qian. Click the link here.

    However, the total valid observation here is around 150 and at least 4/5 of patients don't have CAD diseases. In other words, the outcome variable in the data show extreme cases for "yes". I am not sure the number of observation is large enough to perform Lasso, either. Under this circumstance, in addition to the general proceduce above, do I need to set up anything else (such as weight adjustment or more penalties for "Yes") for model construction? If so, are there any methods to handle such problem?

    Thanks in advance.

    ------------------------------
    Tianwen Ma
    Master of Biostatistics, Class of 2018
    University of Michigan Ann Arbor
    ------------------------------


  • 2.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-16-2016 13:47

    If the goal is prediction, use random forests or neural networks. Let you software decide what is important. Logistic regression is generally really poor for prediction. When it's not really poor, it's still pretty bad. The worst forest or nn will be quite a bit better than the best logistic regression. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 3.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-16-2016 13:53

    How can a logistic regression always be “poor” or at best “really bad” when the underlying model is a logistic model? I think you need to be a bit more specific about when to condemn logistic regression.

    Jim

    ------------------------------
    Jim Baldwin
    Station Statistician
    USDA-Forest Service



  • 4.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-16-2016 21:40

    If the goal is prediction, at worst a random forest will be exactly the same as a logistic regression. What makes RF and Neural Nets so much better is that they will break up continuous variable and "create" interaction terms. Suppose that you have "age" and "gender" as variables in the model. A main effects model only will look at the difference between male and female and look at the effect of adding one extra year to the age value. What if there is an interaction between between age and gender? The main effects model misses it completely. What if there are interactions between age and gender AND the groupings are different based upon gender. Such as, males between 18 and 30 have a low risk, females between 18 and 40 have a low risk, males 30 to 60 have a medium risk and females between 40 and 50 have a medium risk,..... You won't see that from a logistic regression (at least not very easily). You can see that very clearly with CART, Random Forests, Neural Nets, etc. I've gone back an re-examined several dozen "textbook" data sets using RF vs Logistic Regression. RF wins every time. The worst I ever saw was they were close to each other.

    From the point of view of an investigator, you put all you variables into the model and let the software pick the best partitions. With Logistic Regression, you need to worry about collinearity among other things. With RF and NN, you just don't care.

    With logistic regression, you tend to look for one model that is the "best". The question for the statistician is, "What makes this model 'best'? How much worse are the other potential models?" With RF and NN you generate hundreds of models and see what each model tells you.  

    With a logistic model, your software generated a model with the least "error". Suppose you have events occurring 2% of the time. Your logistic regression will likely claim events don't happen and be wrong 2% of the time. However, if those events are "avoidable death", then your logistic regression is a total failure. With RF and NN, you can "tune" your model to be really good at predicting "avoidable death". It will have a higher error rate than the logistic regression. But, you predict living patients will be in the "avoidable death" category. So, you can take extra precautions. So, your RF or NN model will be right say 90% of the time. But, for those with "avoidable deaths" you might accurately predict say 50% to 80% correctly. The logistic regression will be significantly less accurate because predicts everyone as living and is wrong 2% of the time.  

    Something to try at home, use your textbook data and some software that runs logistic regressions, CART, RF and NN models using the same criteria. Split your data into the training, testing and validation data sets. Let the software generate the models. Tune each model for minimal false positive and false negative rates. Do that with 30 other data sets. I can guarantee that the software will show RF and NN models are equivalent to or better than logistic regressions under the same tunings. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 5.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-17-2016 00:57

    We definitely have different definitions of logistic regression.  Your definition seems to include model formulation and variable selection aspects.

    Jim

    ------------------------------
    James Baldwin
    Station Statistician
    USDA-Forest Service



  • 6.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-16-2016 21:51

    Forgot to mention this earlier. Sorry;-)

    Suppose you have a data set that consists of 1000's of tuples of data. Let's say it is on the small side and about 50,000 tuples. Each column of data is totally orthogonal to the other columns too. Because of the size of the data set, you will have a lot of terms in your logistic regression model that "look" statistically significant. However, they will have minimal contribution to the predictive accuracy of the final model. All the typical diagnostics for logistic regression will scream, "HEY LOOK HERE!!! THIS IS REALLY IMPORTANT!!!!!" In the end, who really cares if something increases your risk 0.01%? 

      

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 7.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-17-2016 02:30

    My recommendation is to use random forests as one approach because of the low number of observations in total and the low EPV. You might wish to have a look at the R package ranger which could do this job for you.

    My recommendations are as follows:

    classical random forest in regression mode (probability forest)

    use subsampling instead of standard bootstrapping

    use 2/3 with the subsampling and fix the ratio of cases and controls in the subsampling

    estimate permutation importance to determine the importance of the independent variables

    to test the importance use the approach of Altmann et al. 2010 Bioinformatics or the novel approach of Janitza et al. 2015 Tech Rep, U Munich

    you can estimate confidence intervals for each observation later on using the approach of Wager and Walther (R package available)

    Use conditional inference forests for comparison (they are implemented with maximally selected rank statistics in ranger as well); Tech Rep is on arXiv

    If you want to use logistic regression as well, please, have a look at the excellent textbook of Frank Harrell!

    Andreas

    ------------------------------
    Andreas Ziegler
    Universitaet zu Luebeck



  • 8.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-17-2016 10:05

    Actually, the first method I tried was the random forest. But the prediction has terrible results for Yes. And my friends said random forest works poorly when the data observation is small. I am not sure if they are right, or maybe I should consider what you have suggested to further improve my original idea. Thanks for your detailed explanation anyway.

    ------------------------------
    Tianwen Ma
    Student
    University of Michigan Ann Arbor



  • 9.  RE: Use Lasso Logistic Regression to Analyze Binary Data with

    Posted 05-17-2016 14:09

    How did you tune your data? 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)