Discussion: View Thread

  • 1.  Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 09:59

    Dear All,
    Good day!

    I'm to work on the topics of outlier detection and its remedies in "establishment surveys" or "business surveys" concerning official statistics. I'm also searching for computations methods for recompensing non responses and sample size reduction in these surveys.

    I would be very pleased if you share your knowledge with me and introducing some good references. You may know I'm seeking a right point to start.

    Looking forward to hear from you.

    Kindest regards,
    Amir Kasaeian


    -------------------------------------------
    [Amir] [Kasaeian]
    [PhD Student in Biostatistics]
    [Tehran University of Medical Sciences (TUMS)]
    -------------------------------------------


  • 2.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 13:45
    In my experience, YMMV, the vast majority of suspicious values are data entry problems.

    I am leery of casually modifying or removing values simply because they are extreme.

    Cohen (2003) shows some ways to explore the impact of missing or extreme values by using flag (dummy) variables.

    One quick way to explore multivariate outliers in SPSS is to use the "identify unusual cases" procedure to detect values of variables that should be investigated.

    Boxplots, Crosstabs, skewness, z-scores,  3D scatterplots with marker colors and marker shapes etc. can help you to eyeball odd values for you to investigate further.


    If you do trim or remove values you should see if results without those modifications differ from the results with those modifications.

    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 3.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 13:53
    Boxplots, Z-scores, and similar methods do not work well IF the usual expected distribution is highly skewed. What should the nature of the distribution be. If you were looking at particle data, e.g., Boxplots would not be appropriate.

    And I agree with Arthur, you should NOT remove an outlier unless you have an underlying assignable cause.

    -------------------------------------------
    Patrick Spagon
    -------------------------------------------








  • 4.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 14:25

    For univariate outliers if there is a single outlier that is very extreme it will show up nicely on a box plot.  However the problem with these simple approaches is the masking effect.  Outliers can inflate the mean and the variance and thus their extreme effect is at least partially masked.  Also in the case of multiple outliers the most extreme outlier gets masked by the others.

    The appropriate tests for univariate outliers take account of the fact that you are looking at the most extreme observations (see Dixon's test or Grubbs' test).

    For multivariate outliers there is the added problem of which direction to call extreme.  Like Gnandesikan I advocated the influence function approach.  The influence function measures how much a single extreme observation affect a specific parameter estimate.  When I worked on data validation I felt that for some data sets correlation between two variables might be important to estimate. Then the contours of constant influence are elliptical.  These contours point the way to the direction in the plane that has the greatest effect on the estimate of correlation.  This approach can be applied to a whole host of parameters that may be of interest.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 5.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 20:28
    These are all good points. These data may be skewed and a log or square transformation may be useful before applying an outlier test. My only specific contribution is the observation, made in Barnett and Lewis's book, that the kurtosis test is robust to masking. Good luck with your research.

    -------------------------------------------
    Richard Bittman
    Stat Consulting for Pharm & Device Delopment
    -------------------------------------------











  • 6.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 14:11
    Allow me to branch off from Michael Chernicks' comment "The influence function measures how much a single extreme observation affects a specific parameter estimate."

    In the (vast) majority of cases, a well-specified research question (and there may be many in a major analysis) can be addressed by focusing on a single parameter in some appropriate statistical analysis/model. I call this the focal parameter, beta.focal, and in frequentist-land, it will have an estimate and a CI--and, if you really must, a p-value for comparing the estimate to some (often extremely unlikely) common point null value. Bayesians will obtain posterior distributions.

    Let's assume the data have been cleaned and all cases in the analysis conform to the study's inclusion/exclusion criteria.

    In frequentist statistical modeling, the key influence measures for me are simply is the common DFBETAs for beta.focal, which assess how much each case influences the estimate tied to this particular research question. A "DFBETA" kind of diagnostic can be developed for any statistic.

    The question is not usually about whether or not to discard the most influential case(s) and reanalyze, although you may find cases that are influential because something is just plain wrong. If so, just correct them.

    Rather, the point is to find those cases that are the most important ones for understanding the answer to this particular question. These cases may not be "outliers" in any negative "extreme" sense. In fact, they may be among the most valuable cases in the whole data set, the ones that lead to the greatest jump forward in learning what Mother Nature is trying to tell us though these data and this particular analysis.

    Such is the way of science.



    -------------------------------------------
    Ralph O'Brien
    Case Western Reserve University
    -------------------------------------------








  • 7.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 14:36

    I agree with Ralph that influential outlying observations are sometimes the most interesting.  The term outlier has the connotation of bad or incorrect but it really should have a neutral status.  The fact that an observation is extreme or influence may as Ralph said point you to a surprise in your data that leads to a discovery or scientific advance.  My point about influence functions is that if a single observation has a very high influence on a paremeter like correlation or a regression coefficient I want to know about it.  But statistical analysis of the data does not tell you what to do about the outlier.  That must come from information external to the data as several of us have been saying in this discussion.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 8.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 15:16
    As Michael Chernick pointed out: "But statistical analysis of the data does not tell you what to do about the outlier.  That must come from information external to the data as several of us have been saying in this discussion." In fact, that statement brought back to me the outliers in a longitudinal nutritional survey on which I consulted a number of years ago. The interviewer reported that all the data gathered from a particular household were zeroes. We started to question the validity of that data and consider how to adjust for that data pattern. Then the interviewer revealed to us that it was a period of fasting for that household, and so on the data collection day they had not ingested anything!  All those zeroes were true zeroes! The data were valid! That household had been randomly drawn into the sample with some probability, and its data deserved to be included.

    Since then I have always "looked further": an outlier may be valid and we therefore need to know more about that data.

    -------------------------------------------
    Milton Goldsamt
    Consulting Research Psychologist and Survey Statistician
    -------------------------------------------








  • 9.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 19:40
    Let me offer a technique that's useful for panel data when the distributions are unknown. Normally, I use it to find outliers in a cross-sectional dataset by comparing to a reference cross-sectional dataset. It's a standard method in my part of the U.S. Census Bureau and is spreading. More accurately, it's two closely related methods: one for nonnegative data and the other for data that can take any sign.

    The reference is  "Loss Functions for Detecting Outliers in Panel Data: An Introduction," in The 13th Federal Forecasters Conference - 2003: Papers and Proceedings, Gerald, Debra E. and Norman Saunders [eds.], U.S. Department of Education, Office of Research and Improvement, 265-273. Available at http://www4.va.gov/HEALTHPOLICYPLANNING/ffc/PandP/FFC2003.pdf.

    This paper uses the cutoff view of outliers: they either represent problems with the data generation process or true, but unusual, statements about reality. The analyst has to examine the outliers to determine which is the case.

    -------------------------------------------
    Charles Coleman
    -------------------------------------------



  • 10.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-23-2011 14:00

    What to do about outliers is always a sticky problem and to do the right thing you need to know a lot about your data.  But there is a vast literature about detecting outliers and that was what I think the initial question was a bout.  I have published a lot on this topic and sent Amir a list of publications and books.  Also the Energy Information Administration was very interested in this problem with respect to their data bases back in the late 1970 and the 1980s.  Barnett and Lewis' book is a great comprehensive source and Gnanadesikan discusses the use of influence functions to detect multivariate outliers in his book.  Douglas Hawkins also has a nice monograph on outliers.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------