Tasneem,
Margot Tollefson's advice is very valuable. Plotting the raw data, residuals and simple scatter plots of predicted vs observed and predicted vs individual IV's following regression is a great way to evaluate and ultimately defend models. I would do as much plotting as you can stand but since it can get overwhelming, only on the few models I was getting seriously interested in keeping. (i.e. not at every step of an automatic variable selection process).
Regarding your question about variable selection from the original pool, I have tried many regression modelling strategies over a career, but I need some additional information from you first in order to answer. What is the lowest significance threshhold are you willing to tolerate for variable elimination by an automatic procedure such as stepwise? P-del = 0.05? 0.10? 0.15? An obvious recommendation is to try a tighter criterion than 0.30 for a variable to remain in the regression and see what the procedure leaves you with. Obviously the final significance criterion you choose is subjective, and there is no "right" answer, but your field of study may have practical standards that apply. Whatever standard you choose, selected in advance of performing the regression, must be applied to all the factors uniformly. You should also have in mind a significance standard for the overall model that indicates it's strength in terms of the amount of variation in the response variable that is explained by the regression.
In industry, I used P-delete = 0.05 religously when analyzing data from statistically designed experiments usually with physical variables (temperature, pressure, composition, time, flow,...). When looking at observational data of the same kind, I would tolerate p-delete of up to 0.10, but these are more personal standards which were relevant to use in an industrial setting in which the consequences were usually practical and economic, The numerical tolerance level you select for leaving a variable in a regression is entirely up to you. Your audience and professional field will have its own standards, and in many ways the chosen level depends on what the regression will be used for and what the consequences of being misled by it are.
It appears that all the factors so far are significant at the P-del = 0.15 level, yet you state that x9, x9*x10, x1 and x5 are not significant. At what common level of significance? I would have said, looking only at your output, that x1 (GENDER), x9 and x9*x10 are not significant at the 0.10 level, and tried to repeat the regression using p-del = 0.10 to see what dropped out. In my experience, many models such as this one would begin to fall apart as the significance level for deletion is dropped to successively lower levels. This is because of hidden multicollinearity, and the factors being collectively able to explain more variation in the response than they can individually. That is, each variable has some information to contribute, but all must be present to complete the picture, even if not all are statistically significant at the same time. The problem is, the picture is fuzzy because some of the factors are not highly significant. Because the data are observational, if the deletions standard is reduced too far, nothing will be significant. This leaves the analyst with a dilemma. Is a weak model that indicates potential factors of value? Or must the standard be held tight and any model that fails to pass be rejected? The result depends on the application and the consequences of accepting a model and only you can ultimately decide where to stop this process.
Finally, what is the significance of the overall model? I couldn't find it in the output. If the model is weak overall, and depends on the presence of many variables, some of which are not significant themselves, then the situation is not promising. The model is indicative but not definitive. I would be happy with a strong model (large F-ratio or Chi Sq value) which had very few terms, but each of which was highly significant.
An alternative way to see which regression models withstand the ultimate test of significance is to start the regression with the full pool of all factors and all two factor interactions and use backward elimination with a low value of p-delete, such as 0.05 and see what you get. Occasionaly fairly strong models will appear at or near the very end of such a screening. But if they don't at least you will have seen a large selection of models with potential and know you gave many alternatives a chance to reveal themselves. Comparing models which arise from forward selection procedures with models from backward elimination can lead to discovery of the best models a data set can support, even if they are not as good as we might wish. The only other possibility is a form of exhaustive enumeration in which you try all possible regression models. I reserve this approach only for really high-value situations in which it must be known with high certainty that no model was overlooked.
Good Luck and I'd like to know what you find..
PS "All models are wrong, but some are useful" -- GEP Box, after Snee
Thomas D. Sandry, PhD
Retired Industrial Consultant
-------------------------------------------
Thomas Sandry
-------------------------------------------
Original Message:
Sent: 07-09-2012 15:47
From: Margot Tollefson
Subject: Making further selection from the subset picked by stepwise regression
Tasneem,
I would look at the residuals and the fit for the different models, to see which model actually fits the data better.
Margot
-------------------------------------------
Margot Tollefson
Owner
Vanward Statistical Consulting
-------------------------------------------
Original Message:
Sent: 07-09-2012 15:19
From: Tasneem Zaihra
Subject: Making further selection from the subset picked by stepwise regression
Dear Dr. Sandry: The X's are a mix of continuous and categorical. The ones which are categorical have a category shown in front of them in the Table i copied with my post. I have mean centered the variables and did a correlation analysis as well.
Thank you for your comments and suggestions, they are very helpful. I really appreciate your time.
Dear Dr. Flom: Thank you for your comments and suggestions, I really appreciate your time.
I do have a general idea of important predictors of my outcome. I am trying to run the GLMSELECT with LASSO now.
The main reason for running stepwise or any variable selection method is to see if it returns the same variables as expected from various clinical studies and the stepwise actually does that albeit with with some extra variables. However, not all of them are significant and if I want to discard some of them regardless of whether I use stepwise or LASSO how will/ should I come to that decision statistically?
Thank You,
Best Regards,
Tasneem
-------------------------------------------
[Tasneem] [Zaihra]
[Post Doctoral Fellow]
[McGill University]
-------------------------------------------