Allow me to branch off from Michael Chernicks' comment "
The influence function measures how much a single extreme observation affects a specific parameter estimate."
In the (vast) majority of cases, a well-specified research question (and there may be many in a major analysis) can be addressed by focusing on a single parameter in some appropriate statistical analysis/model. I call this the focal parameter, beta.focal, and in frequentist-land, it will have an estimate and a CI--and, if you really must, a p-value for comparing the estimate to some (often extremely unlikely) common point null value. Bayesians will obtain posterior distributions.
Let's assume the data have been cleaned and all cases in the analysis conform to the study's inclusion/exclusion criteria.
In frequentist statistical modeling, the key influence measures for me are simply is the common DFBETAs for beta.focal, which assess how much each case influences the estimate tied to this particular research question. A "DFBETA" kind of diagnostic can be developed for any statistic.
The question is not usually about whether or not to discard the most influential case(s) and reanalyze, although you may find cases that are influential because something is just plain wrong. If so, just correct them.
Rather, the point is to find those cases that are the most important ones for understanding the answer to this particular question. These cases may not be "outliers" in any negative "extreme" sense. In fact, they may be among the most valuable cases in the whole data set, the ones that lead to the greatest jump forward in learning what Mother Nature is trying to tell us though these data and this particular analysis.
Such is the way of science.
-------------------------------------------
Ralph O'Brien
Case Western Reserve University
-------------------------------------------
Original Message:
Sent: 06-23-2011 14:24
From: Michael Chernick
Subject: Outlier detection, its remedies and computations methods in "establishment surveys"
For univariate outliers if there is a single outlier that is very extreme it will show up nicely on a box plot. However the problem with these simple approaches is the masking effect. Outliers can inflate the mean and the variance and thus their extreme effect is at least partially masked. Also in the case of multiple outliers the most extreme outlier gets masked by the others.
The appropriate tests for univariate outliers take account of the fact that you are looking at the most extreme observations (see Dixon's test or Grubbs' test).
For multivariate outliers there is the added problem of which direction to call extreme. Like Gnandesikan I advocated the influence function approach. The influence function measures how much a single extreme observation affect a specific parameter estimate. When I worked on data validation I felt that for some data sets correlation between two variables might be important to estimate. Then the contours of constant influence are elliptical. These contours point the way to the direction in the plane that has the greatest effect on the estimate of correlation. This approach can be applied to a whole host of parameters that may be of interest.
-------------------------------------------
Michael Chernick
Director of Biostatistical Services
Lankenau Institute for Medical Research
-------------------------------------------