Discussion: View Thread

  • 1.  missing values

    Posted 05-15-2015 06:54
    This message has been cross posted to the following eGroups: Health Policy Statistics Section and Statistical Consulting Section .
    -------------------------------------------


    Hello everybody,

    my question is about handling missing va;ues in a household budget survey. I have grouped my observations (20.000obs) in 5 groups with quintiles and in each quintile I calculate the mean of some variables.The problem is that some variables have very small number of obs. whereas others are better answered with small number of missing data.

    If I use the mean value function in stata, with . for missing values, it computes the mean using the number of valid obs. as size of the sample. If I replace . with 0, it counts the whole dataset, 20.000obs, it the mean value is computed over this sample.

    If I use only the number of valid obs., is it ok to compare later the mean of different variables?

    Using the second way, is it ok to consider that this is the mean value, since the real obs. may be much more higher, but the result is lower due to a large denominator?

    Thank you in advance for your comments/suggestions,


    ------------------------------
    Efthalia Massou
    PhD candidate - Researcher
    Panteion University of Social and Political Sciences
    ------------------------------



  • 2.  RE: missing values

    Posted 05-15-2015 10:01

    It depends.  If you are asking how much the respondent spent on fertilizer, a missing value could be construed as a zero (i.e. they did not buy fertilizer).

    But if you ask for how much they paid for a gallon of milk, you should only count valid observations.

    -------------
    Pedro Saavedra
    Retired
    ------------------------------




  • 3.  RE: missing values

    Posted 05-18-2015 14:50

    Missing values are very difficult. I can't answer your question honestly without understanding a lot more about the individual variables and why they might be missing. Here are some approaches that are worth considering.

    1. No news is good news. If you are looking at certain types of variables, it may be a reasonable assumption that if that value is missing, then that is because only bad values are reported. So a missing value for side effects may be considered the same as no side effects.

    2. No news is bad news. For other types of variables, it may be a reasonable assumption that if that value is missing, then that is because only good values are reported. If someone drops out of a smoking cessation clinic, it is often because they are embarrased to admit that they have "fallen off the wagon" and are smoking like a chimney again.

    Both the no news is good news and the no news is bad news approaches are likely to be questioned by a peer reviewer. One thing you might consider is a sensitivity analysis, where you apply the liberal no news is good news approach and the conservative no news is bad news approach in sequence. If the results of both analyses are comparable, then you can safely conclude that your analysis is unaffected by missingness. You'd be safe to report any reasonable approach.

    3. No news is average news. This is sometimes called mean imputation. If you ask ten Likert scale items and the person responded with an average of 2 on nine of the ten items, you might replace the missing item with 2 as well. it gets a bit tricky when the average is a non-integer, but you might be able to get away with this.

    There's more than one way to average, of course. If you have 100 patients and the average of the 99 with data on question 7 is 3, then you could use a value of 3 for the 100th patient.

    Mean imputation is pretty easy to do, and for many of the analyses that you do (particularly univariate analyses), mean imputation is equivalent to using the average of only the non-missing data values. Mean imputation, however, is frequently bad. It is used a lot, however, and you might get lucky with the peer-reviewers of your paper.

    4. No news is old news. A very similar approach is called last observation carried forward (LOCF). If you have measurements at five time points and a patient has values only for the first four time points, just pretend that the measurement at the fifth time point is unchanged from the fourth time point.

    There is a lot of criticism of LOCF in the literature and you probably won't get past peer review on this one. This is a bit unfair perhaps because mean imputation, which is often far worse than LOCF, is often overlooked by peer reviewers.

    5. No news is zero news. Sometimes a missing value represents "nothing" in a way that makes it safe for you to replace that missing value by zero. Suppose you ask for income in four different categories: wages, interest, dividends, and royalties. If someone lists income for wages, interest, and royalties, perhaps they left dividends blank because they had no dividend income. This is somewhat akin to the no news is bad news option.

    All of the approaches suggested so far are ad hoc, require untestable assumptions about your data, and represent poor statistical practice. But if you are able to get one of these past peer review, I'm not going to complain.

    6. Certain statistical models can tolerate missing values and provide results that are not fairly reasonable. For example, in the longitudinal example, you might try fitting a random effects regression model. This fits a trend line for each patient and pools the results across all patients. The random effects model allows you to extrapolate the individual trend line to missing values, but you need to do it carefully. As I understand it, these types of models work well for the missing completely at random case, but not so well for the missing at random case. Certainly a random effects regression model is preferred to LOCF.

    7. Multiple imputation. This is surprisingly easy and if you've never done this before, you should try it. I can't summarize multiple imputation well in this email, but if done properly, it can handle a wide range of situations with statistical rigor. There's a pretty good book by Steff van Buuren, but that's not the only good book out there.

    8. Bayesian models. You can build a Bayesian model which not only addresses your research hypothesis, but also allows you to predict the missing values. The model can be quite sophisticated. As I understand it, you can even build Bayesian models to handle the Missing Not at Random (abbreviated MNAR) case. The Bayesian model allows you to explicitly define your assumptions about missingness.

    There's a pretty close relationship between multiple imputation and Bayesian models, though the latter has greater flexibility, as I understand it.

    One more comment. No matter what approach you use, it is almost always a good idea to compare the demographics of those who provided data for a particular question and those for whom the data is missing. If the demographics of the two match up reasonably well, you have some level of assurance that any reasonable approach to missingness will work well. If you have a disparity, such as a question that is left blank mostly by older patients, then you have trouble and need to consider a rigorous approach like multiple imputation.
    ------------------------------
    Stephen Simon
    Independent Statistical Consultant
    P. Mean Consulting
    ------------------------------




  • 4.  RE: missing values

    Posted 05-18-2015 18:10

    Stephen has given you a very comprehensive response.  My only modest addition is to suggest that you look closely at the structure of your questions and determine if any form of substitution is valid within the context of the question(s) asked.  

    • Some of the questions may be embedded inside of skip patterns that truly make the question inapplicable to some respondents.  
    • Some might be true structural zeros (e.g., How many children have you carried to term? asked of Males.  Missing for males is in all probability the only appropriate response.)
    • Others might be highly infrequent and built into the questionnaire (e.g., Have you ever declared bankruptcy? ___Yes ___No  If yes: How many times have you declared bankruptcies?  Substitution of 0 for the No response is valid here, while what to substitute for the missing Yes response is another matter.)



    ------------------------------
    David Mangen
    ------------------------------




  • 5.  RE: missing values

    Posted 05-19-2015 19:31

    Dear David,

    many thanks for your suggestions. I see what you mean and in my previous email I give some description for my variables. I think that you can see the email of my reply to Stephen - I'm not sure that I have unserstood the way that these emails work..hahaha- so if you can, please let me know your comments now knowing the meaning of variables.

    Best,

    Lina

    ------------------------------
    Efthalia Massou
    PhD candidate - Researcher
    Panteion University of Social and Political Sciences
    ------------------------------




  • 6.  RE: missing values

    Posted 05-19-2015 19:26

    Dear Stephen, many many thanks for your response and your helpful comments/suggestions.

    Maybe I have to say more about the details, you're right.

    Well, I have a household budget survey and the variables with missing values are about expenditure in certain categories of consumption (e.g foods, drinks etc.). I think that replacing these with 0 it's ok since 0 represents null consumption.

    I also have one variable about private insurance and here is the largest problem since here are the most missing values. In this variable I think that 0 is not correct, since those who didn't answer they didn't it not because they didn't pay for private insurance but because they didn't use if service, they neither have a contract. So, if we replace them with 0s then the median or mean is very very low (about 12euros/year) -calculated in the total sample- whereas the true picture is much more different. Only e.g.1% used private insurance with payments approximately 500euros/year. Isn't it different?

    I think so. My suggestion is to use Little's test and then multiple imputation, but I don't know which method for multiple imputation. Is EM algorithm one of my choices? I have to read some thing in this topic.

    Please do let me know if you have any further suggestions.

    Many thanks once again,

    Lina

    ------------------------------
    Efthalia Massou
    PhD candidate - Researcher
    Panteion University of Social and Political Sciences
    ------------------------------