A further thought about what I mentioned about external forces being entangled with the treatment effects. If one is using the p-value arguments, perhaps one should put both upper and lower bounds on credible p-values. Too small p-values for the expected treatment effect size (and sample size) are just as implausible as low p-values. In my first statistics book, there was a mention of R. A. Fisher proposing that Mendel had fudged his sweet pea genetic trait counts because the fit to his theory was too good. That needs to be thought of also in my discipline as well. In medicine, if one has experience, one has a general idea of the effect of a particular treatment, especially a medical treatment and can tell whether or not the calculated effect size or significance is appropriate or not. But there is are two sides to the effect. It can be too small or too large. Both need to be considered. But too large generally means, there were external forces acting with (confounding) the treatment effect, usually, the source of the problem is outside the data that has actually been collected. These are things like changes in the way data were originally coded, changes in the testing procedures, changes in what data are collected, misclassification of conditions, policy changes and changes clinical guidelines, decisions of department heads locally, changes in suppliers, season or environmental changes, changes in physician and support staffing, and myriad other possibilities. These things can also work against an effect that one is confident that is present. And these problems are probably magnified at least by an order of magnitude in retrospective studies.
Perhaps one step toward a better procedure would be to estimate the apparent effect size and a likely confidence or credibility interval, and then a measure of how sure we are it is different from zero, or whatever is appropriate. The p-value currently is the commonly used (and misused)measure of this surety. With the given sample size, one may calculate an expected interval of p-values [as the measure of surety] on the low side and on the high side that would be in keeping with credibility. P-values outside this interval would call for inconclusive results.
Raoul Burchette, KP Research and Evaluation
------------------------------
Raoul Burchette
Statistical Programmer
Kaiser Permanente
Original Message:
Sent: 10-16-2015 17:36
From: Raoul Burchette
Subject: Banning p-values, continued
In the clinical medicine arena, it seems the most common problem I see with p-values is that people consider them as evidence for something when really they should rather be considered more as evidence against something. We are measuring, or fancy we are measuring, evidence of explanation by chance alone (the null hypothesis). But just because the difference is so great that we get a significant p-value (at whatever level we choose), doesn't mean it is evidence for the effect to which we would like to attribute it. On deeper inquiry it can often be found that there are other confounding factors such as a systematic bias driving the difference that has nothing to do with the treatment at all, or that there are other unmeasured factors that could also be driving the observed difference, or other types of biases (selection bias, misclassification bias, etc.). This is a particular problem in retrospective reviews where there are often too many other things that could have contributed to the results that could also explain the differences observed, but in retrospective reviews people's memories have gotten foggy and the other connections are difficult to trace. This may be a training deficiency, but perhaps the concept itself needs to be reworked altogether. I would be happy to see what else someone can come up with.
Raoul Burchette, MA, MS
Senior Biostatistician III
SCPMG Research and Evaluation
------------------------------
Raoul Burchette
Statistical Programmer
Kaiser Permanente