Discussion: View Thread

Making further selection from the subset picked by stepwise regression

  • 1.  Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 11:23
    This message has been cross posted to the following eGroups: Young Professionals Group and Statistical Consulting Section .
    -------------------------------------------
    Dear All,

    From the following pool of predictors x1, x2,...x11, x1*x2, x1*x3, x1*x7, x1*x9,x1*x10, x2*x7, x2*x9, x2*x10, x9*x7,x9*x10,x7*x10


    With the significance level of the score chi-square for entering an effect into the model = 0.35 and the significance level of the Wald chi-square for an effect to stay in the model =0.30, the stepwise regression retains:


    Analysis of Maximum Likelihood Estimates

    Parameter

     

    DF

    Estimate

    Standard
    Error

    Wald
    Chi-Square

    Pr > ChiSq

    Intercept

     

    1

    -0.9527

    2.7777

    0.1176

    0.7316

    GENDER

    F

    1

    -0.3039

    0.2317

    1.7196

    0.1897

    x3

    High


    -1.3705

    0.7531

    3.3122

    0.0688

    x3

    Low

    1

    0.6218

    0.2321

    7.1749

    0.0074

    x4

    NO

    1

    -0.9698

    0.2931

    10.9478

    0.0009

    x5

    0

    1

    -0.4693

    0.2825

    2.7589

    0.0967

    x7

     

    1

    0.3377

    0.1961

    2.9646

    0.0851

    x9

     

    1

    1.1467

    0.7351

    2.4336

    0.1188

    x10

     

    1

    -1.1156

    0.6400

    3.0383

    0.0813

    x9*x10

     

    1

    0.2520

    0.1626

    2.4012

    0.1212

    x9*x7

     

    1

    -0.1231

    0.0505

    5.9379

    0.0148


    Now as you can see x10, x9*x10, x1and x5 are not significant, so I tried removing some of them to get a parsimonious model

    The results of removing x10& x10*x9 (HL =5.6, c=0.74) are similar to removing x5, x1 and x10*x9 (HL=5.33,c=.73).

    On the other hand , if I remove all non significant variables namely  x10, x9*x10, x1and x5 the HL test statistic value gets little worse (HL=9.23(0.73)). 

    What should I do?

    PS: HL stands for Hosmer and Lemeshow Test

    Looking forward for some suggestions and comments.

    Best Regards,
    Tasneem


    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------


  • 2.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 11:48
    First of all, you shouldn't use stepwise.  The results it gives are incorrect; the p values are too small, the model is too complex, the standard errors are too small and the paramters are biased away from 0. David Cassell and I wrote a paper on this called "Stopping stepwise: Why stepwise and similar variable selection methods are bad and what you should use". It's available several places e.g. www.nesug.org/proceedings/nesug07/sa/sa07.pdf

    In general, automated variable selection methods are not great; it is better to use substantive knowledge (and, given the variable list you started with, you seem to have a bit of that). If you *must* use an automated method, try LASSO or LAR.

    But selecting variables simply because they are significant or not (by whatever method) isn't great. A variable may be non-significant but very important; for one thing, it might be an important covariate; for another, interactions involving it may be important; for a third, getting a small parameter estimate where theory suggests a big one can be just as important as getting a large one where theory predicts a small one.


    -------------------------------------------
    Peter Flom
    -------------------------------------------








  • 3.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 14:58
    Are the x's continuous or discrete?  Have they been individually "mean centered" or not?  Has a correlation analysis been performed on the "mean centered" version of the data or not?  If the IV's (x's) are continuous, please try the mean centering approach and repeat the stepwise regressions using (xi - xibar) in place of all the native factors, and only form the interactions after centering [ie (x3 - x3bar)*(x4 - x4bar)].

    Does this data arise from a statistically designed experiment or controlled trial or intervention?  What is the design structure of the independent variables (IV's)?  If the source of the IV's is observational (no interventions and no design structure), then mean centering all of them before regression is crucial and may aid in identifying other factors or covariates you have recorded that you might wish to include in your model to improve resolution.  Mean centering is especially helpful in dealing with possible multicollinearity, and is the only practical way of dealing with it when using automatic regression features such as stepwise or other variable selection schemes.

    Often this type of regression result (ME conflicts and interactions inconsistently significant) is a symptom of multicollinearity and implies that the "IV's" really aren't independent, but are reflecting the same underlying information content.  In other words you need to pick some, but not all IV's as most representative of the independent sources of information.  But the centering and correlation analysis advice has the potential to determine if there is really information in individual variables that is different enough to justify including even correlated factors in the same model.  The crucial point is that if the variables are just the same information content expressed in a slightly different mathematical form then the "independence" observed is deceptive.  From a cause and effect point of view, you really need to choose just one representative factor for each unique source of information and try to find the best mathematical transformation of it as well.

    Thomas D. Sandry, PhD
    Retired Industrial Consultant



    -------------------------------------------
    Thomas Sandry
    -------------------------------------------








  • 4.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 15:20
    Dear Dr. Sandry: The X's are a mix of continuous and categorical. The ones which are categorical have a category shown in front of them in the Table i copied with my post. I have mean centered the variables and did a correlation analysis as well.
    Thank you for your comments and suggestions, they are very helpful. I really appreciate your time.

    Dear Dr. Flom: Thank you for your comments and suggestions, I really appreciate your time.

    I do have a general idea of important predictors of my outcome. I am trying to run the GLMSELECT with LASSO now.
    The main reason for running stepwise or any variable selection method is to see if it returns the same variables as  expected from various clinical studies and the stepwise actually does that albeit with with some extra variables. However, not all of them are significant and if I want to discard some of them regardless of whether I use stepwise or LASSO how will/ should I come to  that decision statistically?

    Thank You,
    Best Regards,
    Tasneem


    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------








  • 5.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 15:47
    Tasneem,

    I would look at the residuals and the fit for the different models, to see which model actually fits the data better.

    Margot

    -------------------------------------------
    Margot Tollefson
    Owner
    Vanward Statistical Consulting
    -------------------------------------------








  • 6.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-09-2012 18:11
    Tasneem,

    Margot Tollefson's advice is very valuable.  Plotting the raw data, residuals and simple scatter plots of predicted vs observed and predicted vs individual IV's following regression is a great way to evaluate and ultimately defend models.  I would do as much plotting as you can stand but since it can get overwhelming, only on the few models I was getting seriously interested in keeping.  (i.e. not at every step of an automatic variable selection process).

    Regarding your question about variable selection from the original pool, I have tried many regression modelling strategies over a career, but I need some additional information from you first in order to answer.  What is the lowest significance threshhold are you willing to tolerate for variable elimination by an automatic procedure such as stepwise? P-del = 0.05? 0.10? 0.15?  An obvious recommendation is to try a tighter criterion than 0.30 for a variable to remain in the regression and see what the procedure leaves you with.  Obviously the final significance criterion you choose is subjective, and there is no "right" answer, but your field of study may have practical standards that apply.  Whatever standard you choose, selected in advance of performing the regression, must be applied to all the factors uniformly.  You should also have in mind a significance standard for the overall model that indicates it's strength in terms of the amount of variation in the response variable that is explained by the regression.

    In industry, I used P-delete = 0.05 religously when analyzing data from statistically designed experiments usually with physical variables (temperature, pressure, composition, time, flow,...).  When looking at observational data of the same kind, I would tolerate p-delete of up to 0.10, but these are more personal standards which were relevant to use in an industrial setting in which the consequences were usually practical and economic,  The numerical tolerance level you select for leaving a variable in a regression is entirely up to you.  Your audience and professional field will have its own standards, and in many ways the chosen level depends on what the regression will be used for and what the consequences of being misled by it are.

    It appears that all the factors so far are significant at the P-del = 0.15 level, yet you state that x9, x9*x10, x1 and x5 are not significant.  At what common level of significance?  I would have said, looking only at your output, that x1 (GENDER), x9 and x9*x10 are not significant at the 0.10 level, and tried to repeat the regression using p-del = 0.10 to see what dropped out.  In my experience, many models such as this one would begin to fall apart as the significance level for deletion is dropped to successively lower levels.  This is because of hidden multicollinearity, and the factors being collectively able to explain more variation in the response than they can individually.  That is, each variable has some information to contribute, but all must be present to complete the picture, even if not all are statistically significant at the same time.  The problem is, the picture is fuzzy because some of the factors are not highly significant.  Because the data are observational, if the deletions standard is reduced too far, nothing will be significant.  This leaves the analyst with a dilemma.  Is a weak model that indicates potential factors of value?  Or must the standard be held tight and any model that fails to pass be rejected?  The result depends on the application and the consequences of accepting a model and only you can ultimately decide where to stop this process.

    Finally, what is the significance of the overall model?  I couldn't find it in the output.  If the model is weak overall, and depends on the presence of many variables, some of which are not significant themselves, then the situation is not promising.  The model is indicative but not definitive.  I would be happy with a strong model (large F-ratio or Chi Sq value) which had very few terms, but each of which was highly significant.

    An alternative way to see which regression models withstand the ultimate test of significance is to start the regression with the full pool of all factors and all two factor interactions and use backward elimination with a low value of p-delete, such as 0.05 and see what you get.  Occasionaly fairly strong models will appear at or near the very end of such a screening.  But if they don't at least you will have seen a large selection of models with potential and know you gave many alternatives a chance to reveal themselves.  Comparing models which arise from forward selection procedures with models from backward elimination can lead to discovery of the best models a data set can support, even if they are not as good as we might wish.  The only other possibility is a form of exhaustive enumeration in which you try all possible regression models.  I reserve this approach only for really high-value situations in which it must be known with high certainty that no model was overlooked. 

    Good Luck and I'd like to know what you find..

    PS  "All models are wrong, but some are useful" -- GEP Box, after Snee 

    Thomas D. Sandry, PhD
    Retired Industrial Consultant


    -------------------------------------------
    Thomas Sandry
    -------------------------------------------








  • 7.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-10-2012 08:56

    Dear Tasneem,

    To expand on the comments in paragraph 4 of Thomas Sandry's post and an underlying theme of the thread, the relationships among subsets of predictors is important in model selection as well as the interpretability of the models. Under certain circumstances, you may be more interested in the predictive value of the model, perhaps to be confirmed in a new data set, than the estimates for individual predictors. Therefore, you may want to move a subset of related predictors in and out of the model together. These predictors may represent an underlying construct and there may be interesting partial relationships among the subset but, if your primary interest is the predictive value of an interpretable model then working with the subset as a single "predictor" will reduce complexity and sidestep the impact of
    multicollinearity on model selection. All the other sage advice would still apply.

    Best regards,
    David



    -------------------------------------------
    David Reasner
    Albemarle Scientific Consulting LLC
    -------------------------------------------








  • 8.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-10-2012 10:47

    Dear All,

    Thank you for all your comments and suggestions. I will keep them in mind while trying to come up with an efficient model . My goal is predictive modeling and therefore after reading all the suggestions by the group  members I have come to the conclusions

    1)It's not always ideal to have a parsimonious model if its predictive abilities aren't better than a fuller model.

    2) I agree with the idea of plotting  residual plots for model diagnostics but residuals for logistic regression are not easy to interpret because of their non constant variance. So far I have been looking at the Hosmer Lemeshow test for assessing the fit. I will definitely  try to plot fitted versus observed values but isn't Hosmer Lemeshow test doing the same thing, it's just that it is quantifying the discrepancies  in form of a statistic? I am not sure how much more information I can grab visually. Below I have tabulated the information:

    Partition for the Hosmer and Lemeshow Test

    Group

    Total

    Response

    Response

    Observed

    Expected

    Observed

    Expected

    1

    56

    1

    2.45

    55

    53.55

    2

    56

    7

    4.77

    49

    51.23

    3

    56

    7

    7.01

    49

    48.99

    4

    56

    10

    8.49

    46

    47.51

    5

    56

    10

    10.22

    46

    45.78

    6

    56

    10

    12.35

    46

    43.65

    7

    56

    16

    15.27

    40

    40.73

    8

    56

    17

    18.80

    39

    37.20

    9

    56

    23

    24.20

    33

    31.80

    10

    59

    40

    37.44

    19

    21.56

                                                                

    Hosmer and Lemeshow Goodness-of-Fit Test

    Chi-Square

    DF

    Pr > ChiSq

    3.8226

    8

    0.8728

     
    3)Also, the AIC =564.93, -2log(L)=542.93, R-square and R-square rescaled are  =0.1490 & .2205 respectively but i am not sure if R-square and R-square rescaled make much sense for logistic regression as a goodness of fit criteria.

    4) I did look at influence plots for the model in my post and I didn't catch any outliers.

    Thank you for the guidance.
    Best Regards,
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------








  • 9.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-10-2012 15:13

    Tasmeen,

    Thank you for listening to all this commentary on your professional work.  I hope it helps.  Everyone has contributed excellent advice, IMHO.  If you will permit a final few parting comments before leaving you to your real work.

    Judging by the HL test results you posted (ChiSq = 3.82, DF = 8, Pr > ChiSq = 0.87) , you have indeed got a good fit, and the observed and expected frequencies show good agreement for all deciles of risk, with the possible exception of the first, where  obsd = 1 and expd =2.45.  The only possible minor criticism there is that an obsd frequency of 1 is less than the minimum value of 5 associated with the distribution of the chi-squared statistic with 20 cells.  However, Hosmer and Lemeshow (HL) themselves say that this is an overly conservative requirement (p. 150), and most modelers would happily accept results as good as you obtained.

    HL also would agree with your take on the values of Rsquared when applied to logistic regression. "All the various values of Rsquared are low when compared to those typically encountered with good linear regression models.  Unfortunately low Rsquared values in logistic regression are the norm and this presents a problem when reporting their values to an audience accustomed to seeing linear regression values." (p. 167) So don't be shy about reminding the audience to focus on the HL test results and not Rsquared.

    The HL text "Applied Logistic Regression", 2e, Wiley, NY, 2000, is an outstanding subject reference and Chapter 5, Logistic Regression Diagnostics, is a treasure trove of methods for analyzing the fit of a logistic regression to its data.  Oddly, they don't emphasize graphics, except for testing the area under the ROC curve.  They do emphasize the use of Classification Tables as the ultimate summary of the results of a fitted logistic model.  You may wish to try your various models out in the Classification Table format to help assess the validity of including the weakest variables in the final model you select.

    Finally, I would always prefer a strong and parsimonious model if I could have one, but if I must make a prediction and the best model I can discover is weak and not very parsimonious, I will still use the best found model to make predictions, albeit with caveats to the audience.  Because of the multicollinearity problem commonly found in observational data it is not unusual to have terms which are marginally significant but which help hold the entire model together.  As David Reasner pointed out, it might make sense to group variables in order to form a single factor with more explanatory power.  Personally, I like to pick a representative variable from a family of related factors and accept a little less explanatory power in exchange for more parsimony and strength.  In the end, every modeler must be able to justify and live with their own choices. 

    Just for computational fun if you have the time, you might want to try refitting the model using a cross-validation approach, holding some data back from fitting for use later in checking the model's explanatory power.  This can also be a powerful tool for convincing audiences that your model actually predicts data it hasn't seen.

    Good luck and best wishes,

    Tom

    Thomas D. Sandry, PhD

    Retired Industrial Consultant

    -------------------------------------------
    Thomas Sandry
    -------------------------------------------








  • 10.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-10-2012 15:30
    Hi Tom,

    Thank you for your valuable comments and suggestions, I really appreciate your time. I will for sure keep them in mind.

    I do have the classification tables specially the model sensitivity and specificity along with the ROC curves.
    I have tried some cross validation techniques  as well..I will try to include those results in my discussions as well.

    Once again, I would like to thank everyone in this group for their  insightful comments and suggestion.

    Best Regards,
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------








  • 11.  RE:Making further selection from the subset picked by stepwise regression

    Posted 07-23-2012 16:41
    The following applies if these data are from an observational study:

    If you choose to use stepwise you need to remove variables one at a time and check the p-values at each step.  For example, imagine you have a situation where two of your variables are very highly correlated, eg the output of two temperature sensors that are placed very close together.  Further imagine that temperature is extremely important.  In such a situation each sensor is likely to have a non-significant p-value because neither is useful in the presence of the other.  But if you then remove either of them from the model, the remaining sensor will become highly significant.

    Which sensor gives best model?  It probably doesn't matter and the question may even be silly.

    When there are more the two collinear inputs things can get very confusing and stepwise may completely miss certain models.  In the regression world, all subsets regression is one of several alternatives.  It ought to generalize to your application.

    Don't be surprised if you find more than one adequate model and no way to tell which of them is best.  Observational data often gives ambiguous answers.
    --------
    If you have data from an orthogonal design, you can remove the collinearity between cross terms and main effects by centering the data.

    -------------------------------------------
    Emil M Friedman, PhD
    emil.friedman@alum.mit.edu (forwards to day job)
    emilfrie@alumni.princeton.edu (home)
    http://www.statisticalconsulting.org
    -------------------------------------------