Please give us more details about the missing data.
Why do you think it is missing?
What is the meaning of "group continuity"?
Which variables have the extreme amounts of missing data?
Are there cases that have mostly missing data?
Do you have access to the respondents to fill in the missing data?
Your categorical variables could be considered ordinal. Reorder to unable, with problems, without problems.
To do preliminary checks pretend that the ordinal variables are not too discrepant from interval level and do ordinary Pearson correlations among all of the continuous, ordinal, and dichotomous variables. Do this 2 ways. Once with pairwise deletion. That means that each coefficient is bases on all of the cases with valid values on the 2 variables. Then run the correlations again with listwise deletion. That means the correlations are based only on cases that have valid values on all of the variables.
Do the coefficients look wildly different?
For each variable create a dichotomy of valid vs missing. Do another set of correlations on those variable with them selves. Is the missingness "correlated"?
Do correlations with pairwise deletion between each dichotomous variable and the original variables.
What can you say about the missingness now?
What do do after that will depend on the previous efforts.
--You may want do drop some variables and/or some cases.
--You might take a look at CATPCA (Categorical Principal Components) and/or CATREG (Categorical Regression). You can then see results for models with different assumptions of ordinal vs interval level of measurement, and of treating missingness explicitly in models.
--You may want to compare results with listwise deletion, pairwise deletion, and imputed values.
hth
-------------------------------------------
Arthur Kendall
Social Research Consultants
-------------------------------------------
Original Message:
Sent: 10-18-2011 16:02
From: Tasneem Zaihra
Subject: Testing for correlation in continuous and categorical variables with missing values
Dear All,
Thank you for your response(s) to my post. I consider myself really privileged
to have access to such an amazingly helpful group.
To answer quaetions toof few members I just wanted to provide some further details/background about what I am doing. I have a dataset to analyse from a study done by a research group.
The purpose of their study is to evaluate the independent predictive validity of patient reported chronic (some specific chronic disease that the patients are suffering from) disease control while adjusting for other potential explanatory variables such as patient characteristics.
The response is binary. In the sample of 530 we have 39 zeros and 491 one's.
There are 8 continuous predictor variables. They are some scores obtained by converting the response from questionnaire, where patients were asked to rate their response to a certain condition related to the chronic disease under study, for instance, say on a scale of 1-10 or 1-100 rate their health state.
Then there are some categorical variables which are based on counting the individuals falling in a specific category out of three categories to measure there health state, they are as below:
Variable: Categories
Mobility With Problems, without problems, unable
Self Care With Problems, without problems, unable
Usual Activities With Problems, without problems, unable
Pain None, Moderate, Extreme
Anxiety None, Moderate, Extreme
The other class variables as below:
Group Continuity: 0, <50, 50-99, 100
Group Compliance: 0 , 1
Gender: M , F
Age Group 18-39, 40-59, >=60
Income Group: Low, Middle, High
I initially fitted a logistic regression model with all these variables and tried to use stepwise variable selection procedure. Probability modeled was response==1.
The model ends up throwing every variable out except group continuity. Also, it warned about quasi complete separation of data points and deletion of 283 observations out of 530 due to missing values.
Three things that I noticed are:
1) Missing data
2) The relatively small frequency of response in zero category (only 39 out of 530 zeros) as compared to ones.
3) Issue of multicollinearity or some sort of relationship (linear or otherwise) amongst predictor variables. I believe if I can figure that out it will help me make the model less parsimonious as well.
Therefore, I decided to have a closer look at the relationship between each of these variables before modelling. Missing value is an issue and perhaps multiple imputation using some specific method is an alternate. That is what I initially thought of but then as pointed out by some of you if I don't know whether the values are MAR or MCAR then multiple imputation will produce unwanted patterns and when I am exploring patterns of association amongst the predictor variables this might be just like beating around the bush.
I am suspecting multicollinearity but before going ahead with any model fitting technique as suggested by some of you I think my best bet is to do some exploratory analysis to get a better understanding of relationship (linear or otherwise) between these variables.
I hope this background will answer some questions asked by few members of the group.
Again, thank you very much for all your suggestions and I look froward to more of them.
Best Regards,
Tasneem Zaihra
-------------------------------------------
[Tasneem] [Zaihra]
[Assistant Professor]
[Concordia University]
[Montreal]
[QC]
[Canada]
-------------------------------------------