Sorry I didn't mention this earlier. I had to be reminded of it by someone else....
Suppose that there are 2 covariates with your data that were tracked and can be used. On one of my instruments, I performed a DOE to see what would happen to the reported concentration of a CCV. I set the gas flows to the lowest setting that was "Acceptable" and the highest setting that was "Acceptable" and looked at combinations of those.
It turns out that all 7 gasses had a statistically significant impact on the reported concentration. However, only 2 gas flows really mattered, as far as changing the reported concertation of the CCV. As long as I kept those gas flows "in the middle", which meant constantly adjusting them, my results would be good. If I ran some samples and left the room for more than say 45 mins, the gas flows would change so dramatically, that my results would be considered "Outliers" by the standards I had set. However, the only reason they looked like outliers, is because I didn't take into account those 2 gas flows.
If your case is like mine, instead of using a dummy variable, see what other covariates could be used for your model.
Since this is a DOE you are doing, have you added or removed all the non-significant terms from the model?
I've had data where if I took a bottom up approach, meaning I add terms instead of removing them, if I have an important interaction, it looks like I have significant outliers. Once I put those interactions into the model, the outliers go away.
I've also seen some data where I or someone else designed an experiment with "Software A" but person I'm helping with the data used multiple linear regression in "Software B". Even thought Software A and B are doing multiple linear regression on the data, Software A will show low VIF on interaction terms and quadratic terms. Software B will show high VIF on those same terms. Thus, it leads to the removal of terms that should be in the model. Some terms stay in the model that should be removed. That can lead to "outliers" that are only there because of the model you selected, based upon the data provided and the software you use and have nothing to do with the reality of the system.
Just out of curiosity:
1) How many of your 30 observations come from DOE vs "Lab Data"?
2) What design did you use?
3) How was the lab data collected, vs the DOE data? (I have found the Day-to-Day drift on instruments can sometimes be profound. But, the scientists won't do anything because, 'they' don't see it until it bites them in their rear.... and just so we are clear, even if you are 100% accurate in predicting when things go bad, and you complain to your boss that these issues WILL come up, and they do nothing, it's still YOUR fault.... even though you did what they said to do.)
------------------------------
Andrew Ekstrom
Statistician, Chemist, HPC Abuser;-)
------------------------------
Original Message:
Sent: 06-23-2023 12:25
From: Georgette Asherman
Subject: Turning outliers into indicator variables
I have a stats question and appreciate the skill level of this group. I have a regression analysis with let's say 1 continuous response and 3 continuous predictors from 30 observations-a mix of pre-set DOE levels and some other lab data at central values. This final result is obtained after a popular selection method is used. Using commonly accepted criteria, 3 observations are assessed as outliers and influencers. Removing the outliers and fitting a regression of 27 rows yields a very different model.
In a good classical way, the rationale for the outliers is assessed. The engineer provides a lab operations reason, not in the original file, for these 3 observations. Is it good statistical practice to create an indicator variable that identifies the 3 outliers, add it to the 'x' columns and refit the model?
------------------------------
Georgette Asherman
------------------------------