How about we make P-values reflect the probability of someone getting different results?
Suppose that we have a t-test where T-value = Difference/Std Err. If we have a difference of 6.00 and a Std Err of 2.00, we get a T-value of 3.00. If the T-critical value is 2.00, we claim; "We reject the null hypothesis." We can create a confidence interval of 6.00 +/- T-crit*Std Err => 6.00 +/- 2.00*2.00 => (2.00, 10.00). P-Value of test < 0.05.
If someone else does a replicate experiment, if they find that the Difference < T-crit*Std Err, they will fail to reject the null hypothesis. Based upon the data we have, any time someone finds a difference less than 4.00, they get different results. 4.00 to 2.00 are all within our original confidence interval. There is about a 15% chance someone will fail to replicate our results! How does a 15% chance of failing to replicate the results lead to a P-value of less than 0.05?
Furthermore, if we look at a lot of the assumptions we use in our tests, there is a lot of evidence telling us our assumptions are WRONG! But we don't listen. Consider for a moment our assumption that we can use a sample that is "representative" of the true population. How often is this assumption true? As best I can tell, not often enough! Think about all the psychology experiments that are not repeatable. Think of all the clinical trials that "look promising" in a Phase 2 trial and fail miserably in Phase 3? If the sample used in Phase 2 was representative of the population, then Phase 3 should bolster our claims, not diminish them.
If we look at something like bootstrapping, we assume the data we already have is representative of the population. Then we resample from this small sample. If the sample is not representative, then bootstrapping is a pointless exercise. A better idea would be to use the data we have and generate 100's to 1,000's of random values and use the mean and standard deviation of our sample data. That way, we keep the same variability our original sample had.
For those of us that use regression models, how about we start using R^2(predicted) and Confusion Matrices as part of our diagnostics? I've seen several "textbook" data sets where the model "looks good" but fails miserably in predicting the event outcome. The first data set I noticed this on, was David Kleinbaum's "Evan's County Data". About 10% of the people in the study data had CHD. Using the model David created accurately predicts 6-12 of the 60+ cases. A confusion matrix let's you know right away that the model isn't very good for predicting who has CHD. But, since that isn't a criteria for the model, and software for logistic regressions don't offer the option, it's up to the statistician to go forth and check things out. What is sad, is that David did everything right, based on what we learn in stats classes. He's not to blame. It's the idea that we, as statisticians can do no wrong with data analysis, and keep teaching ourselves and reinforcing bad habits, that is to blame.
We have the technology to test our assumptions. We have the ability to use the proper regression model for the data. We don't need to make "Normal approximations" anymore. But we still do. We allow, "It looks good" to substitute for actually being good. We assume simplicity in our models. If the systems we were modeling were simple, then why don't we already know everything? It's because we are using simple models on complex systems, then fearing "over fitting" a model, we tend to severely underfit the model. Then the model doesn't do a good job describing the system and we wonder why so many scientists loathe statistics and statisticians.
Maybe we need to lead by example. Perhaps all statisticians should be required to get a minor in an area outside of mathematics. Perhaps, statisticians need to take classes from industrial engineering departments so they can understand where a lot of there data comes from. Perhaps statisticians need to take classes in chemistry and biology to understand that most scientists refer to "repeated measures" as a "replicate". Perhaps statisticians should work with the devices that generate the data they use to get an understanding about QC issues that crop up and tend to go unnoticed because a lot of scientists believe, "Consistency => Quality" and QC samples "Between the lines => Consistency".
We can all do a lot better. We need to do a lot better. And we, as statisticians, need to be the ones to change first!
------------------------------
Andrew Ekstrom
Statistician, Chemist, HPC Abuser;-)