Discussion: View Thread

Turning outliers into indicator variables

  • 1.  Turning outliers into indicator variables

    Posted 06-23-2023 12:25

    I have a stats question and appreciate the skill level of this group.  I have a regression analysis with let's say 1 continuous response and 3 continuous predictors from 30 observations-a mix of pre-set DOE levels and some other lab data at central values.  This final result is obtained after a popular selection method is used.   Using commonly accepted criteria, 3 observations are assessed as outliers and influencers.  Removing the outliers and fitting a regression of 27 rows yields a very different model. 

    In a good classical way, the rationale for the outliers is assessed. The engineer provides a lab operations reason, not in the original file, for these 3 observations. Is it good statistical practice to create an indicator variable that identifies the 3 outliers, add it to the 'x' columns and refit the model?  



    ------------------------------
    Georgette Asherman
    ------------------------------


  • 2.  RE: Turning outliers into indicator variables

    Posted 06-23-2023 12:50

    I'll defer to others in the section about how best to fit the model with or without outliers. There are a set of regression diagnostics statistics in SAS, and R , especially in, Frank Harrell's HMISC library there are   tools,  for estimating "leverage" and lots of graphing options and so forth. Possibly a transformation of some variables may be in order.



    ------------------------------
    Chris Barker, Ph.D.
    2023 Chair Statistical Consulting Section
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------



  • 3.  RE: Turning outliers into indicator variables

    Posted 06-23-2023 13:30

    Hey Georgette,

    It sounds like the scientists or lab tech knows *why* these observations are outliers.

    Are they all outliers for the same reason? What is this reason? Ex. is the reason more like a treatment/factor? Or is the reason on a continuum? 

    Which "popular selection method" was used?



    ------------------------------
    Glen Wright Colopy
    DPhil Oxon
    Host | The Data & Science Podcast
    Head of Data Science | Alesca Life Tech Ltd
    ------------------------------



  • 4.  RE: Turning outliers into indicator variables

    Posted 06-23-2023 14:23

    As someone who has worked in industry and someone who saw the damage removing "outliers" has on people, I'd keep the data in the model. 

    At one production facility I worked at, the engineers used a 10% truncated Mean and Std Dev for their data. The Mean of this data was never really severely impacted by the truncation. The Std Dev was ALWAYS impacted by this AND would cause a lot of issues once we sold product to customers. For example, we would have a truncated mean and std of say Mean = 32units, Std Dev = 4 units. The customer would reject our supply if over 5% of their samples failed to meet a 25 unit threshold. According to our results, this occurs 4% of the time. 

    The customer DOES NOT use a truncated data. They would find the mean to be say 32 units and the std dev to be 7 units. Based upon the full data, our product fails about 16% of the time. So, they reject our batch. 

    If you look at the water quality data from Flint, Mi from a few years ago, the reason the water crisis went on for such a long time was, the scientists and engineers decided that a couple of the "high" samples were invalid because they are "Too Much" of an outlier.... based upon their (ignorant) assumptions. Had someone competent done the analysis, they would have used an appropriate method to test the results an found that there was NO OUTLIER! 

    I've also worked with chemical analysis systems. We had what is called a raising baseline.  Meaning, if we measured the same thing 20 times in a row, we could plot the reported concentration of the analyte as a function of where it sat in the run sequence. We might get something like: % Recovery = 68% + 3%*Run Position. So, the 40th sample run, will have a reported % recovery of 188%. Clearly an outlier! By using the model I created for the data, we can create a correction factor for that data and report back a more "true" concentration of the analytes no matter what position they are in the run sequence. 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 5.  RE: Turning outliers into indicator variables

    Posted 06-23-2023 19:09

    Hello Amigos, my apologies that I can't go into this deeply today, but I appreciate seeing this discussion. My contribution is that having looked at various literature items in the past to identify techniques for naming and handling outliers, my conclusion is that the field of statistics has not come to a definitive conclusion about what an outlier is and what to do about it.

    My own Inclination is to do the analysis with and without the so-called outliers, but again, being an "outlier" is a tricky thing. If you have a population that is a mixture of a main population and a small fraction of a second population that generates "outliers," then an outlier might pop up every hundred observations or every 30 observations or every thousand observations, or whatever, but they do represent the population, since the population is a mixture. At this point, my only suggestion is to handle this with common sense and good judgment, and be prepared for some arrows.

    Nayak



    ------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light: Statistics & Data Science
    ------------------------------



  • 6.  RE: Turning outliers into indicator variables

    Posted 06-23-2023 16:27

    Essentially the interpretation would be that there is an underlying lab operations effect for these 3 cases that are not true for other cases, and you're trying to capture it as a covariate thus "factoring out" (or sucking out, if you would) the effect from reducing the contamination to the rest of the model ;-) We could debate the statistical correctness of such an approach, but in practice, this is not uncommon especially in "observational data" scenarios and works reasonably well depending on the ultimate goal of the modeling (basically explanatory vs. predictive). That said, I would be a little concerned with only 3 cases taking on the value of "true" in this scenario--you can get some pretty wacky-looking estimates. You might just give it a shot to see what happens, but watch out for the size of the estimate and the size of its standard error in the context of the rest of the model. While it's hard to pinpoint what you should look for in terms of threshold, if it looks weird, it probably is!

    Also, another thing to consider is when doing this: you assume there are no interactions with this "dummy," or you might want to test for interactions. Whether the latter can be done comfortably with the number of observations you have, again it's a bit of a judgment call.

    Given that you have a total of 30 observations, calling 3 of them outliers worthy of removing (i.e., 10% of the total) makes me nervous, and it may even give a false sense of precision, for example if the said factor is part of the normal occurrence and/or if the model is to be used as a predictive algorithm. But I'm probably biased from having lived such a long time in the predictive modeling world. Again, depends on what the goal of the modeling is.



    ------------------------------
    Michiko Wolcott
    Principal Consultant
    Msight Analytics
    ------------------------------



  • 7.  RE: Turning outliers into indicator variables

    Posted 06-24-2023 13:55

    It depends partly on the nature of the outliers, but as a general proposition, including an outlier indicator as an explanatory variable in a regression is problematic, especially with a relatively small dataset.

    At the very least, creating new variables in a model based on an outlier assessment derived from the very same data will result in suspect p-values.  No standard p-value calculation accommodates this complicated interaction between variable creation and stochastic modeling of the responses.

    When the outliers are outlying residuals, the basic issue is that the conditional distribution of the outlying data is unlikely to be anything like the conditional distribution of the remaining data.  Often it would have a greater variance, for instance, especially when the outliers are in both directions: outliers, by their nature, vary a lot!  Thus, in addition to incorporating a new explanatory variable, you will have to fit a model with heteroscedastic responses: not great when the observation : explanatory variable ratio is already down to just 10:1.

    A better approach is to fit a robust regression (essentially least squares fit with IWLS).  But it sounds like it won't tell you something you don't already know: these data are indeed outliers and they do affect the fit.  One advantage of the robust regression, though, is that you can tell your client that the outlier identification is principled and not entirely up to your judgment.  Another is that it is susceptible to easy bootstrapping, simulation, and cross-validation, because the outlier identification and model fitting are performed in a single operation.  This won't help you here, with such a relatively small dataset, but in other circumstances it's well worth considering.

    When the outliers are geometric outliers or (almost equivalently) outlying values of the explanatory variables, the situation is more delicate, because they likely are highly influential points.  Since you get a "very different model" depending on their inclusion, some of them must also be high-leverage points.  This is clear evidence that your conclusions are sensitive to this small subset of the data.  That might be the most important message (and even the most definite message) you can convey.  Your engineers will need to think hard about how their judgment concerning a small subset of the data, if incorporated in your modeling, determines the outcome.

    (In my experience with engineers, which now spans almost 40 years, they are very good at coming up with plausible post hoc explanations for anything.  I respect that, because frequently those explanations provide insight, but I have been around long enough to know that our seemingly nit-picking "theoretical" concerns about data analysis or statistical modeling truly are valid in the real world.  Repeated opportunities to explain away unusual or unexpected data all too easily turn into a self-serving feedback loop that prevents anyone from learning anything about the system or phenomenon being studied.)



    ------------------------------
    William Huber
    Analysis & Inference / Quantitative Decisions
    ------------------------------



  • 8.  RE: Turning outliers into indicator variables

    Posted 06-26-2023 07:14

    This has been an interesting discussion, and it has some historical connection. The lab notebooks of Robert Millikin, who won a Nobel Prize about 100 years ago for determining the charge of the electron, are available online from Caltech. In those notebooks, Millikin records runs of data with notations (I'm paraphrasing) like "great data set - publish this" and "don't use these." Lest we think that a Nobel Prize was awarded for cherry picking data, in each case he also had explanations as to why he included or excluded the dataset. Examples of excluded data were due to finding faulty ground wires, contamination of the oil, etc. So it appears he did things much like what has been discussed in this thread, thinking about both the statistical aspects (although his use of statistics was much less sophisticated than this thread) and the experimental aspects.

    Barney



    ------------------------------
    Bernard Ricca
    Lyda Hill Institute for Human Resilience
    ------------------------------



  • 9.  RE: Turning outliers into indicator variables

    Posted 06-26-2023 02:48

    DOE may have lost its balance by removal of the 3, so that the new variable may lead to less biased estimates than removal then. 

    Were all 30 observations assessed for getting values exactly zero or one in the new variable? Thinking of the O-ring disaster, one may fear that the new variable could be very important (see Andrew's post). 



    ------------------------------
    Reinhard Vonthein
    Universitaet zu Luebeck
    ------------------------------



  • 10.  RE: Turning outliers into indicator variables

    Posted 06-27-2023 11:01

    The paper below is one which I find useful in considering how to handle outliers. 

    Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270-301. https://doi.org/10.1177/1094428112470848 



    ------------------------------
    Steven Pierce
    Associate Director
    Center for Statistical Training and Consulting, Michigan State University
    ------------------------------



  • 11.  RE: Turning outliers into indicator variables

    Posted 06-27-2023 12:59

    Hi Georgette,

    I won't offer any judgement or moral guidance on dealing with outliers, but your question reminded me of a problem set from David Friedman's models course.

    In terms of the linear algebra, though, I can try to remember some of the intuition, for OLS at least: 

    • adding a dummy variable for a single point has a strong / direct relationship with the (single) jack-knife estimate of slopes, and with the jackknife residual
      • some intuition here is also found in counting degrees of freedom: (n - 1) - p vs n - (p+1)
    • also: aggregating two columns together is like constraining two slopes to be equal

    Putting those together, adding one dummy for three obs is going to be like fitting a model where the three jack-knife residuals are constrained to be equal. This will be a little different than the jack-knife-3 "unconstrained" estimate. Which sounds a little odd, but if it solves your problem and that's what your client wants, fine, every data set is a little different.

    Also to think it through, if the three jk residuals are equally opposite signed, that dummy's coefficient should be zero, and you get something like the original fit. If the three jk resids are large and same signed, the dummy coef would be large also.

    This is an argument by analogy, though, but maybe helps think about and reason about the problem. But I did do this as homework at some point.

    A quick and dirty worked example is here - https://gist.github.com/nfultz/9dd68300671780ffdbaed60da96b4d22 - although my memory seems to be weak for the final point, it's off at the third decimal :(

    ---

    A second thought is, if you are also using some method to choose variables, it may interact with this setup - a lasso may zero out the dummy coef (eg equiv to jk resids to be equal and opposite) and otherwise shrink your main coefficients along a different path; there's no guarantee that a decision tree splits out the outliers in a fashion that makes sense; etc. AICs should be compatible, which is generally not true for comparing models fit on different arbitrary subsets, for whatever thats worth.

    And if you are using some method for choosing variables anyway, you can offer it both the combined three-in-one column, plus the three one-point dummies, and let it figure out which is the structure it prefers. It will probably be overfit, but at least you can blame the machinery then.

    Best,

    Neal



    ------------------------------
    Neal Fultz
    ------------------------------



  • 12.  RE: Turning outliers into indicator variables

    Posted 06-27-2023 15:11

    Hello,

    I agree with others that suggested fitting the model with and without the outliers to show how the results changed. The big question for me would be about the nature of the "lab operations reason" for the differing results. I'm not sure I like the idea of simply adding another covariate. You will have to assume that the different lab operation effect would only be on the mean and not on any of the other effects (e.g., DOE variables). For me, I would want to try to determine which lab operation approach to favor. What do the engineers/subject matter experts think about the lab operations reason with regards to how that effect would be expected to bias results relative to "true" results? Overall, I would be doing more to better understand the lab operation reason and it's hypothetical effect on results.

    I hope that helps.

    Regards,



    ------------------------------
    Robert Podolsky
    Biostatistician
    Children's National Hospital
    ------------------------------



  • 13.  RE: Turning outliers into indicator variables

    Posted 06-27-2023 19:12

    Sorry I didn't mention this earlier. I had to be reminded of it by someone else....

    Suppose that there are 2 covariates with your data that were tracked and can be used. On one of my instruments, I performed a DOE to see what would happen to the reported concentration of a CCV. I set the gas flows to the lowest setting that was "Acceptable" and the highest setting that was "Acceptable" and looked at combinations of those.

    It turns out that all 7 gasses had a statistically significant impact on the reported concentration. However, only 2 gas flows really mattered, as far as changing the reported concertation of the CCV. As long as I kept those gas flows "in the middle", which meant constantly adjusting them, my results would be good. If I ran some samples and left the room for more than say 45 mins, the gas flows would change so dramatically, that my results would be considered "Outliers" by the standards I had set. However, the only reason they looked like outliers, is because I didn't take into account those 2 gas flows.  

    If your case is like mine, instead of using a  dummy variable, see what other covariates could be used for your model. 

    Since this is a DOE you are doing, have you added or removed all the non-significant terms from the model? 

    I've had data where if I took a bottom up approach, meaning I add terms instead of removing them, if I have an important interaction, it looks like I have significant outliers. Once I put those interactions into the model, the outliers go away.  

    I've also seen some data where I or someone else designed an experiment with "Software A" but person I'm helping with the data used multiple linear regression in "Software B". Even thought Software A and B are doing multiple linear regression on the data, Software A will show low VIF on interaction terms and quadratic terms. Software B will show  high VIF on those same terms. Thus, it leads to the removal of terms that should be in the model. Some terms stay in the model that should be removed. That can lead to "outliers" that are only there because of the model you selected, based upon the data provided and the software you use and have nothing to do with the reality of the system. 

    Just out of curiosity:

    1) How many of your 30 observations come from DOE vs "Lab Data"?

    2) What design did you use?

    3) How was the lab data collected, vs the DOE data? (I have found the Day-to-Day drift on instruments can sometimes be profound. But, the scientists won't do anything because, 'they' don't see it until it bites them in their rear.... and just so we are clear, even if you are 100% accurate in predicting when things go bad, and you complain to your boss that these issues WILL come up, and they do nothing, it's still YOUR fault.... even though you did what they said to do.) 



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 14.  RE: Turning outliers into indicator variables

    Posted 06-28-2023 21:12
    Georgette, permit me, Not an answer to the question you asked,  Please indulge and permit me to offer my as humorous as possible comment on possible over-use (a term I'll explain below) of indicator variables and my fundamental personal objection to using indicator variables except for very specific situations.  A long time ago I worked in a Health Economics department at a Very big Pharma (Roche).  My VP, is and was formally a Health Economist graduate from Stanford with a legendary-to-health economics dissertation advisor, Victor Fuchs, author of the original and groundbreaking Health Economics text 'Who shall live and who shall die?".  That line of thinking helped to elevate Health Economics as a research field
    I shall call him Lou (coincidentally his name is and was Lou) regularly insisted I prepare statistical analyses with 'dummy variables". So much insistence, that one project I was required to use the NAMCS (national ambulatory medical care survey) a crown jewel federal database using a multi-stage cluster survey.  Seeing an opportunity to demonstrate my bio-statistical prowess to Lou (again my VP) and the top economist at the time at Roche,  I had the folks who write/maintain SUDAAN prepare a package of materials about the software, and I may have even had a team from the developers visit our offices in Palo Alto to demonstrate the software. I presented the package to Lou explaining I required SUDAAN to analyze NAMCS. Lou may (or probably did not) read the materials and replied to me "Thanks but we don't need this $xxxx.xx package. Too expensive , Just use dummy variables". (where xxx.x was a very small number relative to the departmental budget and ever smaller relative to the Roche corporate budget). 
    Lou refused to budge and out the window went my hopes for the fancy and elegant correct analysis, cluster sandwich estimations etc. of a multistage cluster survey.
    I mulled this over and was quite disappointed. However I also had a generous budget for ordering articles (several gig of scans of those articles reside comfortably on my hard drive). And one day I was reading an article, my best recollection was by Richard Royall (in my estimation  on a  Mick Jagger level rock star of survey statistics). And (my best recollection) found a sentence in  a JRSS article by Richard stating something like "and of course one clearly should never attempt to use indicator variables in a statistical analysis to indicate the clusters in the multistage survey". I dashed into Lou's office interrupting his phone call and pointed to the sentence by Royall. For a moment I thought I would soon be able to order SUDAAN.
    Lou remained skeptical. and replied "just use dummy variables".
    Noting my once a lifetime career opportunity, I replied with a very big smile (and Lou and I are still friends)
    NO? but but but Lou, its time for you to know that dummy variables are for .... <fill in the blank>🙂
    Lou was very amused but unpersuaded.
    I didn't get budget for SUDAAN but the project didn't proceed.  And fortunately there is now an R library(ies) for multistage cluster survey analysis.
    i regret I don't recall the specific additional methodologic technical reasons why the dummy variable approach was not a good one. And my recollection from the article is that it was reasonably clear from a technical perspective. 
    Hence arises my general caution about using "indicator" or "dummy" variables for nearly everything other than the general linear model and the generalized inverse- 


    ------------------------------
    Chris Barker, Ph.D.
    2023 Chair Statistical Consulting Section
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------



  • 15.  RE: Turning outliers into indicator variables

    Posted 06-28-2023 22:58

    Ha! Richard Royall's "Statistical evidence: a likelihood paradigm" was the first *physical/paper* stats book that I read for fun (not for obligation).

    My dad had a hard copy on his shelf of stats references.



    ------------------------------
    Glen Wright Colopy
    DPhil Oxon
    Host | The Data & Science Podcast
    Head of Data Science | Alesca Life Tech Ltd
    ------------------------------



  • 16.  RE: Turning outliers into indicator variables

    Posted 06-28-2023 23:30

    I appreciate all the interesting responses.  I will not write an essay to address all the points but will have some general comments.  Yes, engineers are very good at finding post-hoc explanations.  In this case it is a blocking variable related to grouping of lab assays that will be used in future work.  The engineer explained that the outlier group will have consistently lower values 

    As Michiko said, we could debate the statistical correctness but it is a common way to factor out the covariate while getting reasonable estimates of the slopes. We often don't realize what is left out of the model until we run the model.   Still it would be a bad idea to  create a a new variable solely on regression residual outlier criteria in either direction.   

    Regarding the last post--Victor Fuchs is 99 so he is definitely in the group "Who shall live?".  



    ------------------------------
    Georgette Asherman
    ------------------------------