Discussion: View Thread

Testing for correlation in continuous and categorical variables with missing values

  • 1.  Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 11:46
    This message has been cross posted to the following eGroups: Statistical Computing Section and Statistical Consulting Section .
    -------------------------------------------

    Dear All,

    I am trying to identify if there exists some sort of relationship (linear or otherwise) among a group of variables that I have. Some of them are continuous while others are categorical and the sample size is 530, however lot of them have missing observations (up to 110 missing observations in some of them).

    I intend to do a preliminary analysis preferable something graphical. 

     
    When I calculate  correlation for testing linear relationship between continuous variables and test for significance of correlation coefficient it basically does the calculations by ignoring the observations with missing values and same is the case when I try to do chi-square test for categorical variables.



    I would really appreciate your comments and suggestions.


    Thanks
    Tasneem
    -------------------------------------------
    [Tasneem] [Zaihra]
    [Assistant Professor]
    [Concordia University]
    [Montreal]
    [QC]
    [Canada]
    -------------------------------------------


  • 2.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:00
    Since you are dealing with paired data when you are doing correlations, you have to throw out the pair when one variable is missing.  If data are not missing at random there could be bias in the estimate of correlation.  If you are looking at counts in an RxC table and the data is paired the missing member of the pair is left out but the other is not.  This would be the case whether you are doing the Fisher exact test or the chi-square.  Again with a large number missing bias is very possible if the data is not missing at random.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 3.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:02
    It sounds like you just need to set up a basic GLM, while making sure you treat the continuous variables as continuous, and the categorical variables as categorical.

    If you want to know if there's a difference in the values of the continuous variable between different levels of a categorical, I think you'd probably want something like the Least Squares means of the categorical variable.


    -------------------------------------------
    Gabriel Farkas
    -------------------------------------------








  • 4.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:12
    In addition to the comments you've already received ....

    It appears you are only checking for linear correlation. You could have a nonlinear relationship and no linear correlation. That is why the model building suggestion is important.

    Also, you can learn a great deal by applying graphical analysis first.

    Finally, as Michael suggest, the missing values may not be at random. It is important to understand why.

    -------------------------------------------
    Patrick Spagon
    -------------------------------------------








  • 5.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:18

    Tasneem,

    Unless you do something about the missing data you will have analytical issues no matter how you go about this.  Moreover, your problem will only multiply should you venture towards multivariate technique because most procedures will kick out an entire record even when only one of the variables has missing value.  You may consider some form of hot-deck imputation, such as the weighted sequential procedure of SUDAAN.  This way, you can maintain the correlational structure of the observed values while avoiding database attrition due to missing values.

    Good luck,

    -------------------------------------------
    Mansour Fahimi, Ph.D.
    VP, Statistical Research Services
    Marketing Systems Group
    240-477-8277
    -------------------------------------------








  • 6.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:27

    Imputation will allow you to use all the data but I don't think it buys you much because it adds uncertainty. Also I see a circular argument here.  You are doing the analysis to discover the relationship but in order to do a useful imputation you need to know something about how the variables are related.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 7.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:21

     If you have a response variable that you want to relate to a set of continuous and categorical variables then you could use GLM.  But if you are just looking to see that the set of variables are related then you want the estimated correlation matrix for the continuous variables and use can compare one categorical variable to another with an RxC contingency table.  But I think your main concern is regarding the high degree of missing data.  So regardless of the statistical technique you use the validity of the result depends on the mechanism for the missingness.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 8.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 12:48
    When I read the problem I thought two thinks: (1) Tasneem has asked to identify relationships among multiple variables and (2) this is a multivariate problem.  Therefore, I would recommend principal components analysis (PCA).  PCA decomposes the correlation or covariance matrix into eigenvectors and eigenvalues and makes no assumption about the distribution of the data.  Thus, it doesn't matter whether some are categorical and others continuous.  Variables that load highly on a given eigenvector are related.  Yes, much some information can be gained by looking at plots of the data but two or three variable plots is the best he can do and thus they only examine bivariate or trivariate relationships graphically.  This is not very efficient because it does not take into account all of the information at one time.  Using PCA Tasneem can examine all variables simultaneously and hopefully reduce his multivariate problem down to a few principal components that contain groups of interrelated variables. 

    Once the groups of interrelated variables are identified (i.e. those that load highly on a given PC) he can do further data exploration among the variables by performing analysis on the principal component scores or by looking at relationships among variables within a PC.  However, with PCA, observations with missing values are excluded from the analysis.  I do not know of a multivariate technique that does not exclude observations with missing values.  Because Tasneem is only interested in identifying relationships among multiple variables I don't see why it matters whether the observations are missing at random.  Please explain that to me.  From his description he is not comparing two or more treatments where the data is missing because of some treatment effect or other aspect of the study design.  Finally, I don't see how setting up a GLM will help.  That is a univariate approach to a multivariate problem. 

    Because some variables are categorical and others continuous this suggests the variables have difference scales of measurement and probably different variances.  Therefore, Tasneem should perform PCA on the correlation matrix not the covariance matrix.  


    -------------------------------------------
    John Brejda
    -------------------------------------------








  • 9.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 13:28
    If the missingness is non-ignorable the pattern of missingness could disguise or exaggerate relationships between the variables.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 10.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 16:03
    Dear All,

    Thank you for your response(s) to my post. I consider myself really privileged
    to have access to such an amazingly helpful group.

    To answer quaetions toof few members I just wanted to provide some further details/background about what I am doing. I have a dataset to analyse from a study done by a research group.

    The purpose of their study is to evaluate the independent predictive validity of patient reported chronic (some specific chronic disease that the patients are suffering from) disease control while adjusting for other potential explanatory variables such as patient characteristics.

    The response is binary. In the sample of 530  we have  39 zeros and  491 one's.

    There are 8  continuous predictor variables. They are some scores obtained by converting the response from questionnaire, where patients were asked to rate their response to a certain condition related to the chronic disease under study, for instance, say on a scale of 1-10 or 1-100 rate their health state.

    Then there are some categorical variables which are based on counting the individuals falling in a specific category out of three categories to measure there health state, they are as below:

    Variable:                                   Categories
    Mobility                               With Problems, without problems, unable
    Self Care                              With Problems, without problems, unable
    Usual Activities                      With Problems, without problems, unable
    Pain                                       None, Moderate, Extreme
    Anxiety                                   None, Moderate, Extreme

    The other class variables as below:

    Group Continuity:                 0, <50, 50-99, 100
    Group Compliance:                         0 , 1 
    Gender:                                       M , F
    Age Group                        18-39, 40-59, >=60
    Income Group:                    Low, Middle, High



    I initially fitted a logistic regression model with all these variables and tried to use stepwise variable selection procedure. Probability modeled was response==1.

    The model ends up throwing every variable out except group continuity. Also, it warned about quasi complete separation of data points and deletion of 283 observations out of 530 due to missing values. 

    Three things that I noticed are:

    1) Missing data

    2) The relatively small frequency of response in zero category (only 39 out of 530 zeros) as compared to ones.

    3) Issue of multicollinearity or some sort of  relationship (linear or otherwise) amongst predictor variables. I believe if I can figure that out it will help me make the model less parsimonious as well. 


    Therefore, I decided to have a closer look at the relationship between each of these variables before modelling. Missing value is an issue and perhaps  multiple imputation using some specific method is an alternate. That is what I initially thought of but then as pointed out by some of you if I don't know whether the values are MAR or MCAR then multiple imputation will produce unwanted patterns and when I am exploring patterns of association amongst the predictor variables this might be just like beating around the bush.

    I am suspecting multicollinearity but before going ahead with any model fitting technique as suggested by some of you I think my best bet is to do some exploratory analysis to get a better understanding of relationship (linear or otherwise) between these variables.

    I hope this background will answer some questions asked by few members of the group.

    Again, thank you very much for all your suggestions and I look froward to more of them.

    Best Regards,
    Tasneem Zaihra













    -------------------------------------------
    [Tasneem] [Zaihra]

    [Assistant Professor]
    [Concordia University]
    [Montreal]
    [QC]
    [Canada]
    -------------------------------------------








  • 11.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 16:56
    Please give us more details about the missing data.  Why do you think it is missing?

    What is the meaning of "group continuity"?

    Which variables have the extreme amounts of missing data?
    Are there cases that have mostly missing data?

    Do you have access to the respondents to fill in the missing data?

    Your categorical variables could be considered ordinal. Reorder to unable, with problems, without problems.

    To do preliminary checks pretend that the ordinal variables are not too discrepant from interval level and do ordinary Pearson correlations among all of the continuous, ordinal, and dichotomous variables. Do this 2 ways. Once with pairwise deletion.  That means that each coefficient is bases on all of the cases with valid values on the 2 variables.  Then run the correlations again with listwise deletion. That means the correlations are based only on cases that have valid values on all of the variables.

    Do the coefficients look wildly different?

     For each variable create a dichotomy of valid vs missing. Do another set of correlations on those variable with them selves.  Is the missingness "correlated"?

    Do correlations with pairwise deletion between each dichotomous variable and the original variables.

    What can you say about the missingness now?

    What do do after that will depend on the previous efforts. 
    --You may want do drop some variables and/or some cases. 
    --You might take a look at CATPCA (Categorical Principal Components) and/or CATREG (Categorical Regression). You can then see results for models with different assumptions of ordinal vs interval level of measurement, and of treating missingness explicitly in models.
    --You may want to compare results with listwise deletion, pairwise deletion, and imputed values.

    hth




    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 12.  RE:Testing for correlation in continuous and categorical variables with missing values

    Posted 10-24-2011 12:19
    Tasneem,
    Something that's even more noteworthy than the missing predictor problem is the fact that you are fitting a multivariate logistic-regression model, and your 530 binary responses break down as 39 (7.4%) zeroes and 491 (92.6%) ones.  Based on the fact that you have only 39 zeroes, I'm going to suggest the following: even if none of your predictors had missing values, you would nonetheless have an overfitting problem if you try to incorporate all your predictors (I count 18) into the same multivariate logistic regression.  The overfitting problem will of course be exacerbated when 283 (53.4%) of the observations are deleted due to a covariate with a missing value.  One remedy to the overfitting might be to include no more than five predictors at a time into your model.  (A straightforward modification for rare events of Rule 6.6 of Van Belle's Statistical Rules Of Thumb suggests that one could get away with a maximum of eight predictors, but I think that is pushing one's luck given the missingness.)  Putting a restriction like this on the maximum number of predictors in your model should have the side effect of reducing the average percent deleted because of missingness.   

    -------------------------------------------
    Eric Siegel
    Biostatistician
    Univ of Arkansas for Medical Sciences
    -------------------------------------------