Discussion: View Thread

RE:Outlier detection, its remedies and computations methods in "establishment surveys"

  • 1.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 06:03
    This message has been cross posted to the following eGroups: Survey Research Methods Section and Statistical Consulting Section .
    -------------------------------------------

    Dear All,

    Thank you very very much for your guidance. They were very useful. Thank you all.
    But the important point is about the "establishment surveys" that includes large sample data records. You know I mean more than one million records.
    Whether the methods applied to these large records are the same as those applied to the routine sample size? This is very important for me!!

    The next topics are utilizing the best "imputation methods" for compensating total and partial non-responses in the these questionnaires and also and seeking some ways to reduce the "measurement errors" in these types of surveys. What are your recommendations and comments concerning these issues?

    Your knowledge and experiences are very useful for me.

    Again, my many thanks to your nice attention.

    Best regards,

    Amir


    -------------------------------------------
    Amir Kasaeian
    PhD Student in Biostatistics
    Tehran University of Medical Sciences (TUMS)
    -------------------------------------------


  • 2.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 07:55
    I should have mentioned that looking for suspicious values is part of data cleaning/prep.  Obviously a value of 6 is outside the range of legitimate values for a Likert 5 point response scale.  With a medium sized survey of 1 million cases, it is often useful to use the <data> <validation> in SPSS and continually build up the set of rules.  It has options to use existing rules and to add new ones of out of legitimate range values, skip pattern, inconsistencies, etc.  If you use another package you can just continuously develop/build syntax to do these kinds of validation.

    Also, substantive knowledge is important in examining distributions.  Thorough completion and understanding of meta-data is a critical part of data prep.  Comment and documents, variable labels, valid and missing value labels, level of measurement, readable output display formats, are needed.

    The search for artifacts/anomolies should continue throughout the analysis.  For example, finding a five-way interaction is very possibly due to data entry error. 

    -----
    If you do imputation it is often advisable to try different approaches, list wise deletion, pairwise deletion, and  value substitution.
    Value substitution can be done by linear interpolation, mean of all other cases with valid values, mean of a fixed number of cases before after the case that has a missing value,  median of all other cases with valid values, median of a fixed number of cases before after the case that has a missing value, , trend if there is other information, mean of other items in a scale, median of other items in a scale.  Some times cases to find a value to substitute are from teh same cell (intersection strata by clusters) and sometimes across all cases without regard to stratum or cluster. Hot-deck methods are sometimes used.

    The time to reduce measurement error is mostly when you are developing the data gathering instrument.  Cognitive testing is vital.  Look intensively for differences in nuances, denotations, and connotations within and across cultures/languages/disciplines. Use as fine grained a response scale as is practical for the respondents in your rounds of pre-testing. You can always coarsen measurement post hoc, but you cannot refine it.  Despite common usage if one uses the term literally is impossible to disaggregate data.  If you want results by department, product type, shifts etc.  You must gather data by department, product type, shifts etc.

    Total uncertainty is made up of sampling error AND measurement error.  In my experience, YMMV, the uncertainty due to measurement considerations is often much larger than that due to sampling.  Sampling error is very much a lower bound on how much uncertainty there is. However, we usually have more certainty about the amount of sampling uncertainty.


    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 3.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 08:39
    Arthur makes some very good comments about the practical aspects of data cleaning, screening and validation as well as imputation.  I think imputation should be done only when necessary and when there is a model to help with the imputation such as a relationship of the response with some covariates or smoothness or monotonicity with respect to measurements taken over time.  Whenever imputation is used it is wise to look at the sensitivity of results to the choice of method.

    I have to say that when Amir posed the question I did not see it as necessarily a univariate outlier detection problem when the data set is large.  Assuming that I would like to address the question of the appropriateness of univariate outlier detection methods in this setting.

    First let's be very specific about what a test like Dixon's ratio or Grubbs' test is about. In the setting of a univariate sample from a population of a known shape or form an outlier detection test is just a test as to whether or not the most extreme observation or observations are likely to come from a population of the hypothesized form.  If the hypothesis is rejected we conclude that the extreme observation came from a different population.  The test controls the type I error which in this case means that the probability that an observation from the assumed distribution would be as extreme as the observed value is kept small (false detection) perhaps at 0.05.  These tests depend on the sample size  (so the threshold is a function of n) and is based on the distribution of the extreme values, say in Grubbs' case.  It is a test for a single outlier but tests for multiple outliers are also possible.

    These procedures are valid in large samples as well as small sample.  On the other hand people who foolishly use a k sigma rule will detect outliers falsely a lot more often with large samples simply because observations from the tails of the distribution will arise more frequently in large samples (for any distribution) and the k sigma rule ignores the sample size.  Dixon's ratio test depends on the distribution of the spacings between ordered observations.  In the case of the ratio test for a single large outlier, it is the distribution of the ratio of the separation between the largest and second largest observation to the overall range of the data that is used.  This distribution under the null hypothesis will also depend on the sample size n.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 4.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-24-2011 09:41
    These considerations are definitely important when a variable does not have a pre-defined legitimate range, such as a Likert item, a school grade, a percentage right, IQ score, etc.  This is especially true when some arbitrary trimming or deleting is to be done after detecting a suspected outlier. As I said before, I am very leery of such treatment of suspected outliers. I believe that suspected outliers should be modified only after they are "proven guilty" of being an outlier.

    However, if detecting a suspected univariate outlier means double checking data entry and keeping an extra eye on it (e.g., by flagging that case) while doing bivariate and higher explorations during the data prep, then detecting too many suspects is not as great a problem.

    WRT "Whenever imputation is used it is wise to look at the sensitivity of results to the choice of method."  Amen.
    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 5.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-26-2011 16:26
    If you have an idea of what the data distribution actually is,
    you could also use a Q-Q plot instead of a statistical test
    to identify possible outliers.

    The points about outliers as anomalous (and thus perhaps
    illuminating, as noted in Ralph O'Brien's post) values,
    as opposed to impossible values or likely recording errors that need
    to be excised, are all well taken.

    As for what to do about the legitimate "outliers" that remain
    after scrubbing the data:  If the distribution is revealed to be
    something that's clearly not what's assumed by the proposed
    analysis methods, then perhaps transformation, or else switching
    to a different method that does not rely on such assumptions is in order.
    The correct approach may depend as much on what the
    potential consumers of the data need (and how statistically
    savvy they are), and what they plan on using the results for,
    as on the actual state of the data.

    >>Kathy

    -------------------------------------------
    Katherine Godfrey
    -------------------------------------------




  • 6.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 09:27
    The main point I tried to make was that each case needs to be assessed WRT its impact on the analysis at hand. A case may indeed be an "outlier" on one or more variables, yet have a tiny impact on the analysis. If so, it is hardly worth concern. On the other hand, a case may not be an "outlier" by some definition, yet it has substantial impact. If so, it is worth serious attention. Deletion diagnostics (e.g., DFBETA) address this.

    Regarding Q-Q plots... They show nicely the "outlierness" of observations relative to perfect Normality (or some other "perfect" reference distribution). But that's all they do. They do not assess how each case's SET of relevant values influences the specific statistic being used to answer to the given research question.


    -------------------------------------------
    Ralph O'Brien
    Case Western Reserve University
    -------------------------------------------








  • 7.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 10:06
    I would agree with Ralph's remarks but add the caveat that even though an outlier may have little impact on a particular analysis based on certain influence measures it may still be of some concern if the data comes from a data base with multiple users.  Then the provider of the data who may not be aware of all the potential uses of the data should be concerned because it could have an effect on a different perhaps unforeseen Use.  When supporting the EIAs Office of Information Validation in the late 1970s our job was to "validate" energy databases.  My philosophy then was to identify outliers in the database focussing on cases that were highly influential with respect to parameters being estimated in a particular use of the data.  I thought it was important to look at all reasonable potential uses of the data and look at as many influence measures as there were parameters that could be estimated from the data.  Of course this is a lot easier said than done and some valid use may be unforeseen.  So I think in the validation setting all outliers should be investigated and removed if necessary.  If they are not removed there should be a warning to the user that the observation is an outlier that could potentially affect their analysis.

    In other settings such as an individual research project with a particular data set the observation may be shown not to effect the analysis and hence its inclusion in the results may not be a concern.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 8.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 17:13

    I've worked with many "outcomes" databases, such as insurance claims databases, where the data was very-very messy.  When data fails the "sniff test", say, a cost of a hospital stay of $10^85th, and there is (due to HIPAA) no way to return to the source data to determine if its an "outlier" its necessary to either to drop such strange data from the analysis or try to adapt the statistical analysis. Where-/When-ever possible, I (try to ) include the data in the analysis, for example treating such "extreme" data as a type of censored data, and "fill in" a more reasonable censoring value, as part of a "sensitivity" analysis. I've encountered outcomes/insurance claims data were several data values were logically inconsistent.  Without having a way to return to the source data, deleting the logically inconsistent data is the best option.

    I also work with much more pristine, clinical trial data (from randomized double blind trials. etc/), and except for some legacy clinical trials, it is always possible to return to  the source data, and determine if a value is either unusual or in error. When data is in error its routine to correct the data. I haven't yet encountered clinical trial "outliers" which were deleted from the analysis.

    -------------------------------------------
    Christopher Barker, Ph.D.
    Statistical Planning and Analysis Services, Inc.
    www.barkerstats.com
    -------------------------------------------







  • 9.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 17:38

    My background is similar.  I worked with workers comp insurance claims, with energy data and clinical trials data.  I agree that what Christopher say is generally true.  However in my experience although with insurance claims it was not an option to go back to the source data the insurance companies did a better job of keeping the records clean.  These data were mostly cost data and it is important to the insurance company to get that right.  I am not so sure I could say the same about the ICD-9 codes and the covariate were as important to modelling as the response.  In the workers comp data extreme value were not uncommon.  Both cost and duration data are skewed with heavy right tails.  This is because although most claims are routine types of on the job injuries and the costs would typically range in the several thousand dollar range some workers would have catastrophic injuries that make them permanently disabled such as the loss of an arm, leg or a blinding incident.  Such catastrophic cases though not common were not very rare either and would cost hundreds of thousands to millions of dollars.  These data are real and accurate but could show up as outliers using statistical tests that assume normality for example.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 10.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 17:57
    You don't have to drop flaky data. You can use edit rules to impute more reasonable data. This is done all the time in surveys and censuses. See, for example, http://www.census.gov/srd/www/abstract/ssc2007-01.html.

    -------------------------------------------
    Charles Coleman
    -------------------------------------------








  • 11.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 18:04

    Yes, you can do imputation.  I've worked on claims data were perhaps 1/2 the data was logically inconsistent.
    By logically inconsistent, say, the patient died after 5 days in the hospital. And medications were started 10 days after discharge (death).
    you could do imputation, but when large amounts of data are inconsistent, its of questionable value

    -------------------------------------------
    Chris Barker
    Statistical Planning and Analysis Services, Inc.
    -------------------------------------------








  • 12.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 18:05

    There is no doubt that there are methods to impute when outliers are known errors but the question remains whether or not one should impute.  Fellegi and Holt did it for the Canadian census and I am sure that the US census has used hot deck, cold deck and many other types of logical edits which may be better than dropping data. But this really depends on the application and knowledge of the data under study.  I would never advise a client to impute until I know a great deal about the data, the application and requirements associated with the data base. That goes for multiple imputation as well.  No method works magic with bad data.
    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 13.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 06-27-2011 18:19
    I agree with Charles and the importance of maintaining all observations/subjects in the data. It could be necessary to employ inputation method in extreme outlier cases. But, the rest of the data for the specific subject might not be 'damaged'. Eliminating rows could result in biasing the remaining data, besides reducing the available degrees of freedom Anamaria ------------------------------------------- Anamaria Kazanis, PStat ASKSTATS Consulting -------------------------------------------


  • 14.  RE:Outlier detection, its remedies and computations methods in "establishment surveys"

    Posted 07-04-2011 18:49


    Dear all,

    Good day!

    My apologizes for late respond and thank you very much for all advices and comments and the resources you introduce me.
    They are all very useful and informative.


    Thanks again.

    Yours,
    Amir


    -------------------------------------------
    [Amir] [Kasaeian]
    [PhD Student in Biostatistics]
    [Tehran University of Medical Sciences (TUMS)]
    -------------------------------------------