Discussion: View Thread

  • 1.  Imputing missing data

    Posted 05-02-2013 16:21
    This message has been cross posted to the following eGroups: Statistical Computing Section and Statistical Consulting Section .
    -------------------------------------------
    Dear All,

    I have a longitudinal data collected  on patients at "t"  time points and I am trying to find clusters of  the individuals based on few variables (time variant as well as time invariant). I have several missing data and I would like to impute it, in order to see if a patient's cluster designation remains same over time.

    Can anyone suggest any package in R  or PROC in SAS which can do this.

    PS: In my data there is no differentiation between response and predictor so any sort of regression based imputation wouldn't work. I simply want to put patients into similar groups using clustering.

    Thanks
    Tasneem

    -------------------------------------------
    [Tasneem] [Zaihra]
    [Post Doctoral Fellow]
    [McGill University]
    -------------------------------------------


  • 2.  RE:Imputing missing data

    Posted 05-03-2013 01:32
    Hello Tasneem,

    You are probably going to have to clarify things for us, but I am guessing that each person will have "t" observations, each  consisting of realized values on, say, k variables. So, if there are n people, you will have nt observations. 

    Then you are going to form clusters, and I am guessing that you might do that for these nt observations. (I am assuming no missing values for the moment.)  Then--again my wild guess--you might calculate some percentages, such as the number of people who have all t of their observations in the same cluster, the number who have their observations in 2 clusters, number of people with their observations in 3 clusters, etc. 

    Now about the missing values. You can use an imputation method (I like multiple imputation.) You say that we can't use a regression-based method for imputation because there are, essentially, no dependent or independent variables--there are just variables. That is not a barrier. Imputation by a regression method is still fine, because the imputation process just fills in blanks, and you can forget any dependent/independent variable designations used in imputation (once the imputation is completed) and carry on with your clustering with the now-complete dataset. For the imputation process, each variable with missing values is temporarily designated as a dependent variable and regressed on the other variables to develop a model that can be used to impute the missing value for the temporary dependent variable. 

    The reason that I like the multiple imputation method is that it allows you to estimate standard errors for your final descriptive statistics, such as the percent of people falling in only one cluster. If that value turned out to be 67%, for example, you would probably want to know whether it is 67% +/- 3% or (worse) 67% +/- 15%, or whatever. 

    This is all pretty hand-wavey and it is also a complete guess as to what you are really doing,  I suggest that you spell out your project and goals a little more to this audience. For multiple imputation and other methods of handling missing data, have a look at the text by Little and Rubin, "Statistical Analysis with Missing Data".  There is free software for the multiple imputation method (in SAS) at this site: http://www.isr.umich.edu/src/smp/ive/

    I have also not thought through and addressed the issue of standard errors in the face of multiple observations per person. Hey, it's 10:30 at night. I'm going to knock off. 

    Good luck! Tell us more of your story. And, for those reading this, weigh in with your thoughts.

    Best wishes,

    Nayak



    -------------------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light Statistics
    -------------------------------------------