Discussion: View Thread

  • 1.  MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 16:37
    This message has been cross posted to the following eGroups: Young Professionals Group and Statistical Consulting Section .
    -------------------------------------------

    Dear All,

    I have couple of queries and I would really appreciate if someone could help me with them:

    • Ideally while trying to explore the best subset of predictors to  fit a logistic regression model using stepwise logistic regression is is better to first start with a set of all main effects and possible 2X2 interaction (say, x1,x2,x3,x1*x2, x1*x3,x2*x3) or first fit a stepwise logistic regression with main effects only, find out  which ones are significant and then later on try to include interaction effects that correspond to the significant main effects. 

    • When I include the interaction between x1 and x2, x2 turns out to be significant, their (x1*x2) interaction is significant but the main effect of x1 becomes insignificant. I understand that this is a classic example of presence of multicolinearity. For sure, x1 and x2 are highly correlated. However, if I look at the Pearson correlation between x1 and x2 it is 0.395 but it's p-value =0.000 hence the correlation is significant, although this significance could be possible just because of large sample size (n=566). Also, another thing that I want to mention is that we are calculating correlation for quasi continuous data, as x1 is a scores obtained from a questionnaire which has 5 questions and each question has five answers  which are rated from 1 to 5 and the score is the total (thus the X1 score varies from 5 to 25)  and similarly x2 is sum of responses obtained on 20 questions (different set of questions that the ones used for x1, but the response for each question is on a scale of 1 to 5) which  are not on continuous scale.
    Are there diagnostics that can be done other than correlation to evaluate if significance of interaction makes any sense  and can we test for multicollinearity with quasi continuous scores?

    Thank you and I look forward for the comments and suggestions from the section members.

    Have a nice weekend.
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Assistant Professor]
    [Concordia University]
    [Montreal]
    [QC]
    [Canada]
    -------------------------------------------


  • 2.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 17:32
    How about none of the above? You should let us know your research goals before deciding what approach is best, but there are very few instances when any variation of stepwise logistic regression will be helpful.

    If you have a large enough sample size, it is often best to fit all variables (and all interactions) and then interpret the resulting equation, rather than try to prune the model down. You will find, for example, that a model that is pruned back by stepwise regression is likely to have residual confounding.

    By the way, is there a reason why you would expect to see interactions in your data? I normally don't encourage people to go looking for interactions unless there is a strong a priori reason for believing that they may exist. Interactions are somewhat akin to subgroup analysis, and they have many of the same problems.

    As far as the second question goes, you might try centering your variables and then computing the interaction term. It is much, much easier to understand and interpret interaction terms when your variables have been centered. Never, ever leave out a main effect when an interaction is present. It leads to all sorts of problems.

    -------------------------------------------
    Stephen Simon
    Independent Statistical Consultant
    P. Mean Consulting
    -------------------------------------------








  • 3.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 18:50
    I personally think stepwise procedures are fine.  Any procedure that does a sequence of hypothesis tests has the multiple testing problem.  My own research with Lacey Gunter does stepwise selection based on variables that qualitatively interact with treatment.  We use stepwise procedures and adjust to control the FWER.

    Regarding the first order interactions if you have 20 variables there are 380 possible pairwise interactions and if you test all of them you are bound to get some significant ones by chance.  Also it is common to test for main effects first because a significant interaction without a main effect is difficult to interpret and may not be real.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 4.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 19:48

    One situation where I think stepwise is fine is as follows. You have a dependent variable Y and an independent variable X which is of special interest. The question is, is Y related to X after you control for other relevant variables? In this case, it feels legitimate to me to build a model for Y with all of the other variables (not including X). You could use stepwise or some other method. This initial step is not intended to get "the model" (capital letters) or even to test anything about the variables that are thrown in the pot. X is then introduced to the model. If, in this new model, X is meaningfully related to Y (e.g., a non-trivial coefficient for X) and it is statistically significant, then that is evidence in favor of the hypothesis that the relationship of Y to X is real and not random. On the other hand, if the coefficient of X is small and/or it is not statistically significant, that does not show that Y is not related to X in reality. The stepwise procedure may have introduced variables that are correlated with both X and Y and elbowed out X.

    Another case where I have used stepwise is to throw in all the independent variables, let them fight it out, and see how many variables survive. If none enter (forward selection) or none remain (backward selection), then I know that a more painstaking analysis of main effects is not likely to pay off.  

    In more than one of the comments offered during this thread, it was noted that some wisdom, intelligence and experience should be used in model-building. True!  There are often choices that need to be made based on our collaborator's experience and not based on a particular statistical procedure. 

    Finally, I was surprised in looking through several good textbooks on regression that there was not enough space given to model-building (in my opinion.) It is such a critical area of statistics, and we are building models all the time.  Perhaps one of the reasons that it was not covered so thoroughly is that there is an important subjective component to model-building. Yes, subjective! That is hard to put into a text. 

    May I say that I have not perused Harrell's text, oft-cited here, and I look forward to doing that, given the high marks for it offered on this forum

    Your thoughts, colleagues?

    Best wishes,

    Nayak




    -------------------------------------------
    Nayak Polissar
    Consultant
    The Mountain Whisper Light
    -------------------------------------------








  • 5.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 19:59

    i took a short course from Frank Harrell using the book.  I thought he was rather non conventional and did not agree with everything that he espoused.  I do think he has a lot of good applied experience though.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 6.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-31-2012 12:41
    A while back I heard a researcher present a model that had a significant interaction term, a significant main effect, and a nonsignificant main effect. I asked the presenter why he had left the insignificant main effect in the model, and at least three people in the room told me immediately that you must have the main effects in the model if you include the interaction term, as stated in one of the posts in this thread. I asked why and there was no immediate explanation other than to repeat the directive. I didn't pursue the questioning so as not to distract from the presenter's work.

    I thought about it later and asked myself the following. Suppose I were sampling from a population of rectangles. The dependent variable is A, the area of the rectangle, and the independent variables are B, the base of the rectangle, and H, the height of the rectangle. Clearly, the correct model is A=B*H, that is, a regression model with an interaction term and no main effects. Why would I be forced to include the main effects in this model?

    I ask this question in expectation of learning something that I should have learned a long time ago but somehow did not. This apparent gap in my education has been bothering me for quite some time. I will greatly appreciate an instructive response.

    Thanks,

    -- Tom

    -------------------------------------------
    Thomas Sexton
    Professor and Associate Dean
    Stony Brook University
    -------------------------------------------








  • 7.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-31-2012 13:25

    The rule people use is statistical folklore.  That is why you didn't get a proper response to your question.  Many just accept the folklore.  It is of course possible to include interaction effects with main effects. The problem is in practice people find it difficult to interpret.  Interactions are often thought of as lower order effects.  As an analogy in chemistry you can have two chemicals that have a reaction when place in an acid but the reaction when both are placed together is greater than the sum of the two (assuming I can quantify the result of the chemical reaction numerically).  Then there is an interaction on top of the simple main effects that explains the increase.  On the other hand we could have two chemicals that each have no interaction with the acid but if you put them together with the acid you get a reaction.  Then there is an interaction without any main effect.  It is possible.  The statistical model can be valid.  But the folklore is around because in many practical situations we have an apriori belief that any interactions we find would be secodnary to the main effects.  Then if we identify an interaction that is statistically significant but neither main effect is we would tend to think the interaction is spurious.  So in the absence of a theroetical justification for the interaction statisticians are going to dismiss it as a chance occurrence.  Remember that there is a multiplicity issue here.  If you are exploring all main effects and all pairwise interactions there are so many more pairwise interactions to test and hence a great chance for a spurious significant one.  Think of the folklore as a rule of thumb rather than a statistical law.  It is probably a bad idea to look for interactions when there is no apriori reason to suspect that they exist.  Including them in the model lead to non parsimonious models and possible overfitting (also statistical principles that are not absolute laws).
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 8.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-31-2012 13:37
    Tom: very , very nice example showing that main effects are not needed for some models. Indeed, it is possible to construct examples where the main effects are not needed and the interaction is the only term needed (plus possibly the intercept.) One example is Y (dependent variable), X (independent #1) and Z (independent #2.) Z, say, is dichotomous (0/1), X is continuous and Y vs. X has a positive slope when Z = 0 and a negative slope when Z =1. Plotted (Y vs. X), you are staring at a big "X" (the plotted lines, not the variable!) with lines that cross in the middle, like a printed character "X." Some more conditions need to be specified, but the model Y = constant + (coefficient)*Z*X + (noise) is the correct model here (without main effects). However, I see no harm (others please comment) in including main effects (perhaps everything centered) just to show that the main effects are not important for prediction. 

    Best wishes,

    Nayak



    -------------------------------------------
    Nayak Polissar
    Consultant
    The Mountain Whisper Light
    -------------------------------------------








  • 9.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-31-2012 14:21

    Tom asked a great modeling question to which I will respond.
    When you include an interaction should you include all main effects (heritability), at least one (heredity) and is it ever reasonable to include interactions without main effects.  The answer is yes. (Though Tom's discussion of audiance response shows that there is not agreement on this answer.)
    Tom gives an example, calculating the area of a rectangle where an interaction (only) model is appropriate. 
    This topic has received little attention in the literature.  Peter Goos and I wrote a letter to Technometrics that included a discussion of "spike interactions."  The reference is listed below.
    A spike interaction is an adverse effect that rarely happens.  Modeling a spike interactions can use interactions without main effects. 
    (The spike interaction effect is always adverse because if an exceptionally desirable result occurs, then a discovery has been made and a new process involving only the particular setting of all factors will be developed.) 
    Its occurrence requires specific levels of many factors.  In a production process it is a problem that is hard to diagnose.  The Shainin consulting group has a method of dealing with spike interactions that has not been published.  Therefore the reference below is one of the few references.

    Goos, P and Lucas,J. M. (2009)  Letter to the Editor: Comments on A Critical Assessment of Two-Stage Group Screening Through Industrial Experimentation Technometrics 2008, 50, 15-25) Technometrics 2009, 51, 96-97
    Jim 
    -------------------------------------------
    James Lucas
    J M Lucas & Associates
    -------------------------------------------








  • 10.  RE:MultiColinearity, Interactions, Quasi continuous data

    Posted 03-30-2012 18:29


     Step-(anything) regression often performs very poorly, and may select "noise" variables and should be avoided.
    There are several published papers that have examined the performance of step-(wise) procedures.
    I'd suggest you look at those.
    One citation, and citations therein: http://avesbiodiv.mncn.csic.es/estadistica/whittingham.pdf

    Frank Harrel's text book also discusses limitations of stepwise regression.
    http://www.stata.com/support/faqs/stat/stepwise.html

    Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis (2001, Springer-Verlag)

    -------------------------------------------
    Chris Barker, Ph.D.
    President - San Francisco Bay Area Chapter of the American Statistical Association
    www,barkerstats.com

    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    -------------------------------------------