ASA Connect

 View Only
Expand all | Collapse all

Stepwise Regression

  • 1.  Stepwise Regression

    Posted 07-13-2016 12:59
    I'm often asked the question about the use of stepwise (backward or forward) regression method. These methods are very popular in building a model by successively adding or removing variables, especially when the researcher is not sure about which variables to keep or remove. It is not highly recommended method by statisticians. I would be very much interested to know your thoughts about this method and what method should be used alternate to this approach. 
    Thanks,
    Sunita
    ------------------------------
    Sunita Ghosh
    Research Scientist
    Alberta Health Services Cancer Care
    ------------------------------


  • 2.  RE: Stepwise Regression

    Posted 07-14-2016 06:08
      |   view attached

    Please see the excellent artice by Peter Flom.

     

    Blaise

     

    Blaise F Egan, CStat
    Lead Data Scientist
    Modelling and Analysis Team
    BT Innovate & Design
    ____________________________
    Office:  0331 664 5220
    Mobile: 0748 330 7421
    Email:  Blaise.Egan@BT.com

    www.benevolent.bt.com  Not a member? Please join the Fund today!

     

    This email contains BT information, which may be privileged or confidential. It's meant only for the individual(s) or entity named above. If you're not the intended recipient, note that disclosing, copying, distributing or using this information is prohibited. If you've received this email in error, please let me know immediately on the email address above. Thank you. We monitor our email system, and may record your emails.

    British Telecommunications plc
    Registered office: 81 Newgate Street London EC1A 7AJ
    Registered in England no: 1800000

     

     




    Attachment(s)

    pdf
    stepwise.pdf   127 KB 1 version


  • 3.  RE: Stepwise Regression

    Posted 07-14-2016 08:16

    If the data set is large, then recursive partitioning is a good method. There is R code. SAS JMP is better. We studied various codes some years ago. The best version we found is commercial from Golden Helix.

    ------------------------------
    Sidney Young
    Retired



  • 4.  RE: Stepwise Regression

    Posted 07-15-2016 11:47

    JMP's stepwise platform is superb.  You can have it add or remove terms automatically, you can do it manually by clicking on boxes, you can have it stop at each step to let you override its decisions.  If have higher order terms, you can tell it to constrain your models to be hierarchical.  You can to lock certain terms in or out.  You can also do all possible regressions in the same platform.  You can save the history into a data table and plot model versus various different metrics (AIC, BIC, R-square, R-Square adjusted, Mallow's Cp, etc).

    It has a creative default method for nominal variables that you can use or turn off.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org



  • 5.  RE: Stepwise Regression

    Posted 07-18-2016 05:27

    Recursive partitioning is useful for exploratory analysis, as it can pick out a small subset of the variables that have good predictive power and it usually produces a comprehensible set of rules that can be easily encoded in SQL in an operational database.

    There are downsides, though. First, it is very unstable - change the training set even slightly and you can get a very different tree. Second, it's predictive power is modest. If you want predictive power go for random forests, but that works as a black box, with no comprehensible model.

    I would use it for EDA only.

    Blaise 

    ------------------------------
    Blaise Egan
    Principal Research Statistician
    British Telecommunications PLC



  • 6.  RE: Stepwise Regression

    Posted 07-19-2016 08:19
    I think that Blaise Egan has described recursive partitioning quite nicely. 

    I also think that people use it mainly because it is very easy, it always gives what appears to be a reasonable answer, it includes between-covariate interactions up to any order you like, and you can even draw a picture of the resulting decision tree.  But as a tool for obtaining a fitted regression model that explains what is going on in a data set, it is garbage.

    Peter Thall
    Dept. of Biostatistics
    M.D. Anderson Cancer Center

    The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems.






  • 7.  RE: Stepwise Regression

    Posted 07-19-2016 23:26
    I agree with the comments about using stepwise as a data exploration aide; however, that process requires (IMVHO and experience) a lot of thought and inspection. 

    When I do stepwise regression, I start with as many variables as reasonably make sense (usually < several hundred), I then look at every step in the process (with as many reasonable stats as are available). At each step I look not only at the variables that are currently in the solution and the one that is chosen at that step, but also the entry stats for the next step. In addition, I look at the change in coefficients from the previous steps, paying close attention to sign changes or dramatic changes in size and contribution. 

    In combination to all this, I look at the intuitive relationships between the variables and the dependent variables (particularly directional relationships that may not fit with common sense to a SME - Subject Matter Expert). As a last step in this process, I look at the change in fit statistics. Quite often, after 10-15 variables the fit stats will either not change more than at the 3rd or 4th significant digit or even decrease. This indicates that one is getting into an overfit situation (I ignore the actual significance levels). This last step is more of an Occam's Razor effect, no need to make the model more complex than needed.

    I will typically throw out variables that cause sign changes or don't satisfy SME intuition (unless the effect are strong and can be explained). When the final set of variable is determined, I look at mulicollinearity stats, like VIF. If these are too large then I try to identify the cause and throw out one of the variables.

    This may not result in the optimum model statistically, but usually the difference is trivial and results in a simpler, more intuitive model that is more likely to remain stable over time, fit within constrained environments (like credit or insurance models) and be more acceptable to the business decision makers (who should be paying reasonably close attention to models they will use to make crucial business decisions).

    In my opinion, if these strategies had been followed during the housing crises, many of the bad decisions made back then might have been avoided.

    Granted, this is a process that seems to work well in marketing and risk models. I have had limited experience in Biomed, but I have no reason to think that this process is reasonable in most environments.

    Michael L. Mout, MS, Cstat, Csci
    MIKS & Assoc. - Senior Consultant/Owner





  • 8.  RE: Stepwise Regression

    Posted 07-14-2016 10:14
    Model building has two aspects- statistical and physical or biological.  Step-wise regression is a powerful statistical tool for model building, particularly models that are linear in parameters.  The classical F-statistic and its associated p-value for each of the variables in the model lets you decide whether to keep a particular parameter or not.  Saying that, sometimes it is possible that a physically or biologically plausible parameter maybe excluded by the statistic or vice versa.  That is where the interaction between the modeler and investigator comes in.  I would not exclude step-wise (up or down) modeling by regression being inappropriate.  There are also multivariate techniques,e.g. principal components analysis, that can be extremely helpful.  The downside of that is computational complexities involving computing eigen values from observations which may have some numerical instabilities.  In summary, I would not abandon step-wise regression techniques ("F for in and F for out" etc.) in my model building process but keep discussing the validity and implication of some of the parameters of your model. Standard statistical packages such as BMDP, SAS, SPSS, SYSSTAT, etc. all have such programs.  BMDP actually helps you verbally at each step of model building whether a parameter for the model should be kept or not.  There maybe free R++ codes available for such processes.  Good luck.

    Ajit K. Thakur, Ph.D.
    Retired Statistician





  • 9.  RE: Stepwise Regression

    Posted 07-14-2016 10:29


    The Flom and Cassel paper mentions the importance of the bias-variance tradeoff in considering the number of regressors kept, and the importance of "...eliminating models that do not make substantive sense."  An important factor in such 'sense,' considering the various ways variables may interact, and perhaps what they had in mind, is to consider subject matter theory.  Otherwise the number of combinations of variables to consider could become immense, not to mention consideration of nonlinearity.  In such a case, there could be numerous spurious results if all we look at are the data in a vacuum of subject matter theory.  Principal components would muddle such considerations, and other methods mentioned may also be too oriented toward data exploration - not that data exploration is not important, but here it needs help.  Perhaps it is generally better to just consider a range of reasonable models, and rely on the validation with independent data that would follow anyway.

    ------------------------------
    James Knaub
    Lead Mathematical Statistician
    Retired



  • 10.  RE: Stepwise Regression

    Posted 07-14-2016 11:53

    The Flom paper succinctly presents many of the major concerns about step-based methods, which have a pronounced tendency to be opportunistic by exploiting idiosyncratic features of your data set that may not generalize to the population. Hence, model validation is strongly recommended.

    Two other features of step methods sometimes overlooked are: (a) you have no guarantee that the best ensemble of independent variables will be selected; and (b) changing the enter/remove criteria can affect the efficacy of the procedure in finding a near-optimal ensemble, especially when there is multicollinearity and/or many IVs from which to build a model.

    ------------------------------
    David Morse



  • 11.  RE: Stepwise Regression

    Posted 07-14-2016 14:57
    Edited by C. Sterling Portwood 07-20-2016 16:01

    In my opinion stepwise regression should never be used.  If you have any thought that you might want to consider whether or not your independent variables might be causes of your dependent variable, then stepwise regression is virtually guaranteed to mislead you. 

    The only possible use of stepwise regression would be for building models to make nonmanipulative predictions.  Yet this is not good either because the P values of or confidence intervals around your coefficients or R value will be far smaller than they should be.  This is because these values assume that you hypothesized one and only one model and your results flow from a single regression on that model.  In fact stepwise regression considers a large number of possible models and then selects the best fit, and therefore underestimates the P values and confidence intervals.

    I would suggest standard multiple regression.  It implicitly subtracts out the multicollinearity, i.e., correlations between independent variables, and uses the uncorrelated remainders to calculate the regression coefficients.  If all other conditions for causal inference are met, a very difficult and highly unlikely state, then the coefficients could be considered causal, subject to the typically numerous assumptions required, to arrive at your causal inferences.

    Concerning the P values and confidence intervals, if you posited only one model (and did not do any data snooping) and then ran a standard multiple regression to obtain your results, those P values and confidence intervals should be valid, given the assumptions that you made.  The last phrase, i.e., "given the assumptions that you made," makes the important point that P values and confidence intervals consider all assumptions made to be 100 percent accurate.  This is a potential problem throughout statistics.

     

     ------------------------------
    C. Sterling Portwood, PhD
    Causal Statistician
    Center for Interdisciplinary Science
    ------------------------------





  • 12.  RE: Stepwise Regression

    Posted 07-14-2016 15:20
    It all depends on context.

    At one extreme, variable selection methods, including any kind of stepwise regression or all possible subsets regression, have no place in the primary analysis of the primary endpoint of any study.

    At the other extreme, when doing exploratory data analysis, considering various models can be very helpful in hypothesis generation for future studies.

    In the middle, all possible subsets might be helpful in confirming that a pre-specified set of predictor variables is necessary and sufficient although there are often other simpler ways to address this issue.

    I bemoan the fact that context is not emphasized enough in teaching statistical methods of all kinds.





  • 13.  RE: Stepwise Regression

    Posted 07-15-2016 03:32

    Elastic net is available in SPSS (and R) together with ridge regression and lasso. Tuning parameters lambda may be set by cross-validation or bootstrap. A review paper (Pavlou m, Ambler G, Seaman, S, De Iorio M, Omar RZ: Review and evaluation of penalized regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine 2014; 35: 1159-1177) outlines when which is best. Their favourite: Stochastic search variable selection, a Bayesian method using spike and slab priors.

    Boosting should be considered.

    If "10 events per variable" (Vittinghoff E, McCulloch CE: Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007; 165: 710-718) mandates small models, all of these could be estimated and the best one chosen by Mallow's Cp, AIC, BIC, ... . It depends, which objective has priority: Estimation, prediction, selection, mechanistic insight. Harrell F: Regression Modeling Strategies. Springer 2001 discusses this in detail.

    ------------------------------
    Reinhard Vonthein
    Universitaet zu Luebeck



  • 14.  RE: Stepwise Regression

    Posted 07-15-2016 08:32

    Sas Proc GLMSelect can perform variable selection with LASSO, adaptive LASSO, or elastic net.

    It can also do model averaging.  That's the procedure that I use as an alternative to outdated stepwise regression.

    Brandy Sinco

    MS Statistics, MA Mathematics

    ------------------------------
    Brandy Sinco
    Research Associate



  • 15.  RE: Stepwise Regression

    Posted 07-15-2016 10:27

    As always, there are some very good comments here.

    Rather than recommending a specific technique in lieu of stepwise regression, I will merely observe that before choosing a technique one must be sure that they understand the desired purpose of the model to be built.  Many researchers plunge ahead with a "multivariable model" without a clear understanding of the model's purpose. 

    For example, in some medical research papers, the primary objective is to determine whether a certain group of patients is at higher risk of some event (say, cardiovascular mortality) than those not in the group, in which case a multivariable model may be used to loosely "adjust" for some potential confounders.  In that case, we are less concerned about the model's overall fit and properties, and more concerned about whether the additional covariates in the model are those which would confound the primary relationship of interest - since the primary goal is to answer the question "Is (Group X) at higher risk of (Outcome Y)?"

    In other medical research papers, the primary objective is to develop a comprehensive "risk score" - in which case the principal concern is ensuring strong model fit and good predictive accuracy, and the individual factors' effects may matter less than the overall model fit.  With the primary goal being accurate prediction of outcome, we must structure our modeling in such a way to achieve that goal (although reasonable people will disagree on how best to achieve that!)

    ------------------------------
    Andrew D. Althouse, PhD
    Supervisor of Statistical Projects
    UPMC Heart & Vascular Institute
    Presbyterian Hospital, Office C701
    Phone: 412-802-6811
    Email: althousead@upmc.edu