Discussion: View Thread

Back to discussions

Expand all | Collapse all

Testing for correlation in continuous and categorical variables with missing values

1. Testing for correlation in continuous and categorical variables with missing values

Recommend
Tasneem Zaihra Rizvi
Posted 10-18-2011 11:46
This message has been cross posted to the following eGroups: Statistical Computing Section and Statistical Consulting Section .
-------------------------------------------

Dear All,

I am trying to identify if there exists some sort of relationship (linear or otherwise) among a group of variables that I have. Some of them are continuous while others are categorical and the sample size is 530, however lot of them have missing observations (up to 110 missing observations in some of them).

I intend to do a preliminary analysis preferable something graphical.

When I calculate correlation for testing linear relationship between continuous variables and test for significance of correlation coefficient it basically does the calculations by ignoring the observations with missing values and same is the case when I try to do chi-square test for categorical variables.

I would really appreciate your comments and suggestions.

Thanks
Tasneem
-------------------------------------------
[Tasneem] [Zaihra]
[Assistant Professor]
[Concordia University]
[Montreal]
[QC]
[Canada]
-------------------------------------------
2. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Michael Chernick
Posted 10-18-2011 12:00
Since you are dealing with paired data when you are doing correlations, you have to throw out the pair when one variable is missing. If data are not missing at random there could be bias in the estimate of correlation. If you are looking at counts in an RxC table and the data is paired the missing member of the pair is left out but the other is not. This would be the case whether you are doing the Fisher exact test or the chi-square. Again with a large number missing bias is very possible if the data is not missing at random.

-------------------------------------------
Michael Chernick
Director of Biostatistical Services
Lankenau Institute for Medical Research
-------------------------------------------
3. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Gabriel Farkas
Posted 10-18-2011 12:02
It sounds like you just need to set up a basic GLM, while making sure you treat the continuous variables as continuous, and the categorical variables as categorical.

If you want to know if there's a difference in the values of the continuous variable between different levels of a categorical, I think you'd probably want something like the Least Squares means of the categorical variable.

-------------------------------------------
Gabriel Farkas
-------------------------------------------
4. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Patrick Spagon
Posted 10-18-2011 12:12
In addition to the comments you've already received ....

It appears you are only checking for linear correlation. You could have a nonlinear relationship and no linear correlation. That is why the model building suggestion is important.

Also, you can learn a great deal by applying graphical analysis first.

Finally, as Michael suggest, the missing values may not be at random. It is important to understand why.

-------------------------------------------
Patrick Spagon
-------------------------------------------

Original Message
5. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Mansour Fahimi
Posted 10-18-2011 12:18
Tasneem,
Unless you do something about the missing data you will have analytical issues no matter how you go about this. Moreover, your problem will only multiply should you venture towards multivariate technique because most procedures will kick out an entire record even when only one of the variables has missing value. You may consider some form of hot-deck imputation, such as the weighted sequential procedure of SUDAAN. This way, you can maintain the correlational structure of the observed values while avoiding database attrition due to missing values.

Good luck,

-------------------------------------------
Mansour Fahimi, Ph.D.
VP, Statistical Research Services
Marketing Systems Group
240-477-8277
-------------------------------------------

Original Message
6. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Michael Chernick
Posted 10-18-2011 12:27
Imputation will allow you to use all the data but I don't think it buys you much because it adds uncertainty. Also I see a circular argument here. You are doing the analysis to discover the relationship but in order to do a useful imputation you need to know something about how the variables are related.

-------------------------------------------
Michael Chernick
Director of Biostatistical Services
Lankenau Institute for Medical Research
-------------------------------------------

Original Message
7. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Michael Chernick
Posted 10-18-2011 12:21
If you have a response variable that you want to relate to a set of continuous and categorical variables then you could use GLM. But if you are just looking to see that the set of variables are related then you want the estimated correlation matrix for the continuous variables and use can compare one categorical variable to another with an RxC contingency table. But I think your main concern is regarding the high degree of missing data. So regardless of the statistical technique you use the validity of the result depends on the mechanism for the missingness.
-------------------------------------------
Michael Chernick
Director of Biostatistical Services
Lankenau Institute for Medical Research
-------------------------------------------

Original Message
8. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
John Brejda
Posted 10-18-2011 12:48
When I read the problem I thought two thinks: (1) Tasneem has asked to identify relationships among multiple variables and (2) this is a multivariate problem. Therefore, I would recommend principal components analysis (PCA). PCA decomposes the correlation or covariance matrix into eigenvectors and eigenvalues and makes no assumption about the distribution of the data. Thus, it doesn't matter whether some are categorical and others continuous. Variables that load highly on a given eigenvector are related. Yes, much some information can be gained by looking at plots of the data but two or three variable plots is the best he can do and thus they only examine bivariate or trivariate relationships graphically. This is not very efficient because it does not take into account all of the information at one time. Using PCA Tasneem can examine all variables simultaneously and hopefully reduce his multivariate problem down to a few principal components that contain groups of interrelated variables.

Once the groups of interrelated variables are identified (i.e. those that load highly on a given PC) he can do further data exploration among the variables by performing analysis on the principal component scores or by looking at relationships among variables within a PC. However, with PCA, observations with missing values are excluded from the analysis. I do not know of a multivariate technique that does not exclude observations with missing values. Because Tasneem is only interested in identifying relationships among multiple variables I don't see why it matters whether the observations are missing at random. Please explain that to me. From his description he is not comparing two or more treatments where the data is missing because of some treatment effect or other aspect of the study design. Finally, I don't see how setting up a GLM will help. That is a univariate approach to a multivariate problem.

Because some variables are categorical and others continuous this suggests the variables have difference scales of measurement and probably different variances. Therefore, Tasneem should perform PCA on the correlation matrix not the covariance matrix.

-------------------------------------------
John Brejda
-------------------------------------------
9. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Michael Chernick
Posted 10-18-2011 13:28
If the missingness is non-ignorable the pattern of missingness could disguise or exaggerate relationships between the variables.

-------------------------------------------
Michael Chernick
Director of Biostatistical Services
Lankenau Institute for Medical Research
-------------------------------------------

Original Message
10. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Tasneem Zaihra Rizvi
Posted 10-18-2011 16:03
Dear All,

Thank you for your response(s) to my post. I consider myself really privileged
to have access to such an amazingly helpful group.

To answer quaetions toof few members I just wanted to provide some further details/background about what I am doing. I have a dataset to analyse from a study done by a research group.

The purpose of their study is to evaluate the independent predictive validity of patient reported chronic (some specific chronic disease that the patients are suffering from) disease control while adjusting for other potential explanatory variables such as patient characteristics.

The response is binary. In the sample of 530 we have 39 zeros and 491 one's.

There are 8 continuous predictor variables. They are some scores obtained by converting the response from questionnaire, where patients were asked to rate their response to a certain condition related to the chronic disease under study, for instance, say on a scale of 1-10 or 1-100 rate their health state.

Then there are some categorical variables which are based on counting the individuals falling in a specific category out of three categories to measure there health state, they are as below:

Variable: Categories
Mobility With Problems, without problems, unable
Self Care With Problems, without problems, unable
Usual Activities With Problems, without problems, unable
Pain None, Moderate, Extreme
Anxiety None, Moderate, Extreme

The other class variables as below:

Group Continuity: 0, <50, 50-99, 100
Group Compliance: 0 , 1
Gender: M , F
Age Group 18-39, 40-59, >=60
Income Group: Low, Middle, High

I initially fitted a logistic regression model with all these variables and tried to use stepwise variable selection procedure. Probability modeled was response==1.

The model ends up throwing every variable out except group continuity. Also, it warned about quasi complete separation of data points and deletion of 283 observations out of 530 due to missing values.

Three things that I noticed are:

1) Missing data

2) The relatively small frequency of response in zero category (only 39 out of 530 zeros) as compared to ones.

3) Issue of multicollinearity or some sort of relationship (linear or otherwise) amongst predictor variables. I believe if I can figure that out it will help me make the model less parsimonious as well.

Therefore, I decided to have a closer look at the relationship between each of these variables before modelling. Missing value is an issue and perhaps multiple imputation using some specific method is an alternate. That is what I initially thought of but then as pointed out by some of you if I don't know whether the values are MAR or MCAR then multiple imputation will produce unwanted patterns and when I am exploring patterns of association amongst the predictor variables this might be just like beating around the bush.

I am suspecting multicollinearity but before going ahead with any model fitting technique as suggested by some of you I think my best bet is to do some exploratory analysis to get a better understanding of relationship (linear or otherwise) between these variables.

I hope this background will answer some questions asked by few members of the group.

Again, thank you very much for all your suggestions and I look froward to more of them.

Best Regards,
Tasneem Zaihra

-------------------------------------------
[Tasneem] [Zaihra]
[Assistant Professor]
[Concordia University]
[Montreal]
[QC]
[Canada]
-------------------------------------------
11. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Arthur Kendall
Posted 10-18-2011 16:56
Please give us more details about the missing data. Why do you think it is missing?

What is the meaning of "group continuity"?

Which variables have the extreme amounts of missing data?
Are there cases that have mostly missing data?

Do you have access to the respondents to fill in the missing data?

Your categorical variables could be considered ordinal. Reorder to unable, with problems, without problems.

To do preliminary checks pretend that the ordinal variables are not too discrepant from interval level and do ordinary Pearson correlations among all of the continuous, ordinal, and dichotomous variables. Do this 2 ways. Once with pairwise deletion. That means that each coefficient is bases on all of the cases with valid values on the 2 variables. Then run the correlations again with listwise deletion. That means the correlations are based only on cases that have valid values on all of the variables.

Do the coefficients look wildly different?

For each variable create a dichotomy of valid vs missing. Do another set of correlations on those variable with them selves. Is the missingness "correlated"?

Do correlations with pairwise deletion between each dichotomous variable and the original variables.

What can you say about the missingness now?

What do do after that will depend on the previous efforts.
--You may want do drop some variables and/or some cases.
--You might take a look at CATPCA (Categorical Principal Components) and/or CATREG (Categorical Regression). You can then see results for models with different assumptions of ordinal vs interval level of measurement, and of treating missingness explicitly in models.
--You may want to compare results with listwise deletion, pairwise deletion, and imputed values.

hth

-------------------------------------------
Arthur Kendall
Social Research Consultants
-------------------------------------------

Original Message
12. RE:Testing for correlation in continuous and categorical variables with missing values

Recommend
Eric Siegel
Posted 10-24-2011 12:19
Tasneem,
Something that's even more noteworthy than the missing predictor problem is the fact that you are fitting a multivariate logistic-regression model, and your 530 binary responses break down as 39 (7.4%) zeroes and 491 (92.6%) ones. Based on the fact that you have only 39 zeroes, I'm going to suggest the following: even if none of your predictors had missing values, you would nonetheless have an overfitting problem if you try to incorporate all your predictors (I count 18) into the same multivariate logistic regression. The overfitting problem will of course be exacerbated when 283 (53.4%) of the observations are deleted due to a covariate with a missing value. One remedy to the overfitting might be to include no more than five predictors at a time into your model. (A straightforward modification for rare events of Rule 6.6 of Van Belle's Statistical Rules Of Thumb suggests that one could get away with a maximum of eight predictors, but I think that is pushing one's luck given the missingness.) Putting a restriction like this on the maximum number of predictors in your model should have the side effect of reducing the average percent deleted because of missingness.

-------------------------------------------
Eric Siegel
Biostatistician
Univ of Arkansas for Medical Sciences
-------------------------------------------

Original Message

Discussion: View Thread

Testing for correlation in continuous and categorical variables with missing values

Tasneem Zaihra Rizvi10-18-2011 11:46

Michael Chernick10-18-2011 12:00

Gabriel Farkas10-18-2011 12:02

Patrick Spagon10-18-2011 12:12

Mansour Fahimi10-18-2011 12:18

Michael Chernick10-18-2011 12:27

Michael Chernick10-18-2011 12:21

John Brejda10-18-2011 12:48

Michael Chernick10-18-2011 13:28

Tasneem Zaihra Rizvi10-18-2011 16:03

Arthur Kendall10-18-2011 16:56

Eric Siegel10-24-2011 12:19

1. Testing for correlation in continuous and categorical variables with missing values

2. RE:Testing for correlation in continuous and categorical variables with missing values

3. RE:Testing for correlation in continuous and categorical variables with missing values

4. RE:Testing for correlation in continuous and categorical variables with missing values

5. RE:Testing for correlation in continuous and categorical variables with missing values

6. RE:Testing for correlation in continuous and categorical variables with missing values

7. RE:Testing for correlation in continuous and categorical variables with missing values

8. RE:Testing for correlation in continuous and categorical variables with missing values

9. RE:Testing for correlation in continuous and categorical variables with missing values

10. RE:Testing for correlation in continuous and categorical variables with missing values

11. RE:Testing for correlation in continuous and categorical variables with missing values

12. RE:Testing for correlation in continuous and categorical variables with missing values