ASA Connect

 View Only
Expand all | Collapse all

Banning p-values, continued

  • 1.  Banning p-values, continued

    Posted 10-14-2015 15:28

    A healthy discussion is still ongoing within ASA.

    Here's one thought about why p-values deserve to be re-evaluated:  The

     

    idea of a p-value as one possible summary of evidence

    morphed into a

     

    rule for authors:  reject the null hypothesis if p < .05,

    which morphed into a

     

    rule for editors:  reject the submitted article if p > .05,

                which morphed into a

     

    rule for journals:  reject all articles that report p-values

     

    Bottom line:  Reject rules.  Ideas matter.

     

     George Cobb

     




  • 2.  RE: Banning p-values, continued

    Posted 10-15-2015 14:11

    Nice summary George. 

    To me the biggest problem with a p-value is that it is not a stand-alone statistic.  When you can change your conclusion by changing your sample size, that's a big problem.

    I've seen contractors claim that equipment met specifications for a Govt purchase, when a 0.05 level would be hard to impossible to fail with such a small sample size as was common with expensive destructive testing of the hardware, even if the equipment performance was ridiculously below specifications.  And now, with "big data" comes the opposite problem.  So dropping a one-size-fits-all "acceptance" level is the very least that can be done.  The fact that this (0.05) has become so popular is evidence that the p-value is commonly misinterpreted.

    People on ResearchGate seem to like the letter found at the following URL:

    https://www.researchgate.net/publication/262971440_Practical_Interpretation_of_Hypothesis_Tests_-_letter_to_the_editor_-_TAS

    Hypothesis testing is rather frequently discussed on ResearchGate.

    Cheers - Jim






  • 3.  RE: Banning p-values, continued

    Posted 10-16-2015 16:19


    I guess from this sudden realization about what p-values mean, we’d better change how courtrooms work. Up until now, in order to convict, there has to have been enough evidence presented against the accused to convince a jury that the evidence deviates substantially (this word requires discussion) from what would be expected if the accused were innocent. Also in the past, no expert witness has been able to testify that “the probability that the accused did it is (fill-in a number, ALA Mr. Spock).” Different p-values have different consequences in different settings. A scientist who has an idea, who does an experiment to investigate, and who gets a p-value less than .05 might get a publication out of it, some citations in the literature, and have other scientists do the same experiment to see if they get the same results. In court, however, if the only evidence is a p-value less than .05, there would never be a conviction. On the other hand, I have seen an age discrimination suit settled simply because of a p-value of .000000000088 and the defendant company not wanting this revealed in court. I wonder if some of the sudden opposition to the use of p-values in making judgments is that society is becoming increasingly reluctant to make judgments. I hope not.

    ------------------------------
    Robert Norton



  • 4.  RE: Banning p-values, continued

    Posted 10-17-2015 14:38


    Which is worst.... Is it better to convict an innocent man or let a guilty person go free.  Is it better to say that a blood supply is tainted when it is not, or to accept a blood supply when it is tainted.   In my opinion one should use p-values only in conjunction with beta values or power.  Even if we fail to reject, keep in mine what we're actually saying is that the data did not provide enough evidence to reject

    .

    ------------------------------
    Aubrey Magoun
    Consultant
    Applied Research & Analysis, Inc.



  • 5.  RE: Banning p-values, continued

    Posted 10-19-2015 11:00


    Aubrey Morgan said, "Is it better to convict an innocent man or let a guilty person go free".

    A complication: Convicting an innocent man may also result in letting a guilty person go free.

    ------------------------------
    Martha Smith
    University of Texas



  • 6.  RE: Banning p-values, continued

    Posted 10-20-2015 17:20


    It is not solely the data that reveals the evidence, the test one uses, or the analysis technique also makes a difference.  The power calculation has to be based on the analysis technique actually used.  So again, we have effect confounding of the data with the technique.  The data contains the evidence, but it is filtered, if you please, by the technique used to recover it.

    Raoul Burchette

    ------------------------------
    Raoul Burchette
    Statistical Programmer
    Kaiser Permanente



  • 7.  RE: Banning p-values, continued

    Posted 10-16-2015 16:19


    For more lively discussion about p-values, see the following posts from Andrew Gelman's blog:

    What do you learn from p=.05? This example from Carl Morris will blow your mind. - Statistical Modeling, Causal Inference, and Social Science

    Shows that even considering both p-value and effect size isn't enough.

    Misunderstanding the p-value - Statistical Modeling, Causal Inference, and Social Science

    Discusses the oft-repeated mis-statement that "The p-value tells you if the result was due to chance"

    Psych journal bans significance tests; stat blogger inundated with emails - Statistical Modeling, Causal Inference, and Social Science

    The p-value is not . . . - Statistical Modeling, Causal Inference, and Social Science

    ------------------------------
    Martha Smith
    University of Texas



  • 8.  RE: Banning p-values, continued

    Posted 10-15-2015 14:12

    As is well known, under some condtions, a p-value (p) can come close to what researchers usually want to know (and, in a decision analytic situation, what could be used to maximize expected utility): the post-experimental probability of the tested hypothesis.

    Under those conditions, reporting p might be informative.

    The problem is, however, that not many people (even among statisticians) know the conditions needed to establish that correspondence. A flat prior distribution for the population parameter(s) might not suffice.


    ------------------------------
    Andrew Hartley
    Associate Statistical Science Director
    PPD, Inc.
    ------------------------------




  • 9.  RE: Banning p-values, continued

    Posted 10-15-2015 14:13

    Banning the use p-values because they can be misinterpreted  is akin to banning the prescription of morphine because it is addictive.  

    When used properly, both our powerful tools: one for pain the other for avoiding conclusion based on evidence that might be really attributable to chance.  

    ------------------------------
    Phillip Kott
    RTI International
    ------------------------------




  • 10.  RE: Banning p-values, continued

    Posted 10-15-2015 14:13

    Interesting topic. I agree that the use of p-values shouldn't necessarily be banned, but we should reconsider the way they are currently being used and consider alternative "evidence". As it currently stands it seems that most people using p-values do not fully understand them! A very useful reference is "A Dirty Dozen: Twelve P-Value Misconceptions" by Steven Goodman. (An example of the misunderstood p-value in this article: a test meant to test medical residents' understanding of statistics wrongly interprets a p-value.)

    Fisher himself wrote:
    "Personally, the writer prefers to set a low standard of significance at the 5 percent point … A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance." [emphasis added]

    This implies that a significant p-values suggest a repeated experiment, not necessarily the acceptance of a hypothesis.

    ------------------------------
    Ruth Cassidy
    ------------------------------



  • 11.  RE: Banning p-values, continued

    Posted 10-15-2015 14:13

    Hmmm.  Is  "Reject rules" to be a rule? 

     

    Peter F. Thall, PhD

    Department of Biostatistics

    M.D. Anderson Cancer Center

    Houston, Texas 77230-1402

    office FCT4.614

    telephone 713 794 4162; fax 713 563 4243

     






  • 12.  RE: Banning p-values, continued

    Posted 10-15-2015 14:14

    "Reject rules. Ideas matter."

    I thought that the p-value carried an idea behind it - the idea of measuring the strength of evidence.  If we don't agree that it is a proper way to measure the strength of evidence, then we reject it. But, a more fundamental question is, to me, do we want to reject the effort to measure the strength of evidence altogether? Is this a meaningful endeavor? Can the strength of evidence be adequately captured by a simplistic numerical summary albeit p-value, confidence interval, Bayesian posterior credible interval?

    ------------------------------
    Ravi Varadhan
    Johns Hopkins University
    ------------------------------




  • 13.  RE: Banning p-values, continued

    Posted 10-15-2015 20:51


    Pay more attention to effect sizes; also, lack of effect is often an important finding since the tells future researches what to not wastevtimevon.

    ------------------------------
    Peter Wludyka
    University of North Florida



  • 14.  RE: Banning p-values, continued

    Posted 10-16-2015 11:08


    Ruth's comment from Fisher is important.
    Yes, Ravi, when you can use a confidence interval, that is far more meaningful.  A single p-value is incomplete, and thus not a good stand-alone measure. There needs to be a power analysis, or some other sensitivity analysis.  A sequential hypothesis test makes more sense, as it compares two simple hypotheses, which leads to what Peter W. said about effect size, which is also important. 

    ------------------------------
    James Knaub
    Lead Mathematical Statistician
    Retired



  • 15.  RE: Banning p-values, continued

    Posted 10-19-2015 10:59


    A (good) power analysis is better than no power analysis, but three other points:

    1. Often power analyses are not good. One common problem is "retrospective power." (See item 7 at http://www.ma.utexas.edu/users/mks/statmistakes/PowerMistakes.html)

    2. For discussion of other problems in calculating and using the more appropriate "prospective power," see

    Beyond The Buzz Part VI: Better ways of calculating power and sample size

    3. Still, power (or other considerations based on just Type I and Type II errors) may not be adequate. So I recommend also considering what Gelman and others call Type S and Type M errors. Gelman and Carlin introduce what they call "design analysis," based on these two types of errors, that can be done either prospectively or retrospectively. See their paper at  http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

    for more information.

    ------------------------------
    Martha Smith
    University of Texas



  • 16.  RE: Banning p-values, continued

    Posted 10-16-2015 11:08


    The word "Banning" should be banned!

    Statistics is about probability. Can anyone say that the probability for use of p-value to be correct is an absolute zero? Can anyone say that none of the papers that used p-value published in the past 100 years have any reference values?

    True, p-value has its limitations. But give me a statistic that does not have any limit for it to be used in any statistical inferences?

    Even if we have a vote on whether to ban the use of p-value and banning side wins with 60% to 40% margin, the expected probability for banning p to be "correct" is only 0.6 + or - errors.

    I do not like the behavior that someone is forcing others to accept his or her rule, especially in the field of statistics, where the general principle is that there is no absolute correctness. Instead of "Banning", it is sufficient to point out the limit of p-value from time to time. Let the journals, the institutions and the statisticians to determine whether to use p.

    ------------------------------
    Qiyuan Pan
    Mathematic Statistician
    National Center for Health Statistics



  • 17.  RE: Banning p-values, continued

    Posted 10-16-2015 16:19

    Excellent point made by George.  An ex-colleague of mine (just retired) wrote a book on this topic.  The book just came out.  It is titled "Corrupt Research" by Raymond Hubbard and published by Springer.  I recommend it highly.  His spent most of his career working in this area. 

     

    Rahul

    Iowa State University






  • 18.  RE: Banning p-values, continued

    Posted 10-16-2015 16:57

    Right after I sent it out, I noticed that there was a typo.  If you don't mind, let me resend it (reword it as well).  Thanks.

     

    Excellent point made by George.  An ex-colleague of mine (just retired) wrote a book on this exact topic.  The book just came out and is titled "Corrupt Research" by Raymond Hubbard (published by Springer).  If you are interested in this topic, I recommend it highly.  He makes very compelling arguments against p-values because they are so misused (as George stated it very eloquently).    

     

    Rahul

    Iowa State University

     

     






  • 19.  RE: Banning p-values, continued

    Posted 10-16-2015 16:18


    This is a very nice discussion among like-minded people, refining their arguments. But what would you suggest the journal's editors?

    It seems to me like the journal is trying to make an honest effort in creating better practices, that may help with the huge reproducibility problem of research in Psychology and Sociology. BASP’s editors think that the use of p-values is associated with bad research, in that researches prioritize obtaining a specific p-value over other important aspects of research. They provided a set of recommendations to investigators who prepare manuscripts for publications in their journal. I am not working in this field nor ever prepared a paper for publication in this journal, but their recommendations seem reasonable.

    They believe that by forcing researchers to focus more on other aspects of the data and results, the researchers will do a better job overall in reporting their study. Is this reasoning completely flawed? which other feasible strategies would you suggest?

    Tamar Sofer

    ------------------------------
    Tamar Sofer
    University of Washington



  • 20.  RE: Banning p-values, continued

    Posted 10-16-2015 17:42


    In the clinical medicine arena, it seems the most common problem I see with p-values is that people consider them as evidence for something when really they should rather be considered more as evidence against something.  We are measuring, or fancy we are measuring, evidence of explanation by chance alone (the null hypothesis).  But just because the difference is so great that we get a significant p-value (at whatever level we choose), doesn't mean it is evidence for the effect to which we would like to attribute it.  On deeper inquiry it can often be found that there are other confounding factors such as a systematic bias driving the difference that has nothing to do with the treatment at all, or that there are other unmeasured factors that could also be driving the observed difference, or other types of biases (selection bias, misclassification bias, etc.).  This is a particular problem in retrospective reviews where there are often too many other things that could have contributed to the results that could also explain the differences observed, but in retrospective reviews people's memories have gotten foggy and the other connections are difficult to trace.  This may be a training deficiency, but perhaps the concept itself needs to be reworked altogether.  I would be happy to see what else someone can come up with.

    Raoul Burchette, MA, MS

    Senior Biostatistician III

    SCPMG Research and Evaluation

    ------------------------------
    Raoul Burchette
    Statistical Programmer
    Kaiser Permanente



  • 21.  RE: Banning p-values, continued

    Posted 10-19-2015 09:32


    Over the last few years I have been simulating results from experiments. I think we would be better served banning weak statistical results. (Those from experiments where the power is low.) 

    For example, suppose you have 2 groups, X1 and X2. You find that the standard deviation of each group is S1=S2= 1.00. You used 4 samples per group, N1=N2=4. You found that the mean of X1= 3.00 and the mean of X2 = 1.00. When you perform your t-test, you find the p-value= 0.030. If you now generate 10,000 random values R1 and R2 based upon R1 <-N (X1, S1/sqrt (N1)) and R2<-N (X2, S2/sqrt (N2)) andretend these new results are those from 10,000 other researchers replicating your experiment, you find that about 60%-66% come to the same conclusion that R1 and R2 are "significantly different". Oddly, the power of this test is in the range of 60%-66%. (Having performed dozens of other simulations like this one, I found that the probability of getting similar results is about the same as the power of the test.)

    So, what is the p-value my software gave me in the first place supposed to be? It's not the probability of getting the same results if the experiment is replicated. It seems to be a completely useless thing, by itself. The closer a p-value is to 0.05, the more likely others will not come to the same conclusions. Are we fooling ourselves with what a p-value really is? 

    ------------------------------
    Andrew Ekstrom



  • 22.  RE: Banning p-values, continued

    Posted 10-20-2015 15:02


    Andrew, it appears that your simulations have confirmed that the power of the test is the power of the test.  But the p-value is supposed to represent the chance of observing results as divergent from the null value or more divergent, than these which are currently observed.  It is the integration of the tail probabilities under the null hypothesis.  In simulation, this would correspond to using the same underlying distribution to generate membership in both groups and calculating the probability your test statistic is greater than or equal to the test statistic you observed (that you are trying to replicate).  Most good simulation studies comparing statistical techniques and their associated results also attempt to make this calculation as well, that is, they look to see how probable the values greater than or equal to the proverbial p-value cutoff values (0.05, 0.01, etc.) actually were in the simulation under the null hypothesis.

    Raoul Burchette, KP Research and Evaluation

    ------------------------------
    Raoul Burchette
    Statistical Programmer
    Kaiser Permanente



  • 23.  RE: Banning p-values, continued

    Posted 10-19-2015 11:00


    Raoul's concern about other possible alternate hypotheses is addressed by the concept of "strong inference," which is discussed at length by Fred L. Bookstein in his book Measuring and Reasoning (Cambridge, 2014).

    ------------------------------
    Martha Smith
    University of Texas



  • 24.  RE: Banning p-values, continued

    Posted 10-20-2015 15:02


    A further thought about what I mentioned about external forces being entangled with the treatment effects.  If one is using the p-value arguments, perhaps one should put both upper and lower bounds on credible p-values.  Too small p-values for the expected treatment effect size (and sample size) are just as implausible as low p-values.  In my first statistics book, there was a mention of R. A. Fisher proposing that Mendel had fudged his sweet pea genetic trait counts because the fit to his theory was too good.  That needs to be thought of also in my discipline as well.  In medicine, if one has experience, one has a general idea of the effect of a particular treatment, especially a medical treatment and can tell whether or not the calculated effect size or significance is appropriate or not.  But there is are two sides to the effect.  It can be too small or too large.  Both need to be considered.  But too large generally means, there were external forces acting with (confounding) the treatment effect, usually, the source of the problem is outside the data that has actually been collected.  These are things like changes in the way data were originally coded, changes in the testing procedures, changes in what data are collected, misclassification of conditions, policy changes and changes clinical guidelines, decisions of department heads locally, changes in suppliers, season or environmental changes, changes in physician and support staffing, and myriad other possibilities.  These things can also work against an effect that one is confident that is present.  And these problems are probably magnified at least by an order of magnitude in retrospective studies.

    Perhaps one step toward a better procedure would be to estimate the apparent effect size and a likely confidence or credibility interval, and then a measure of how sure we are it is different from zero, or whatever is appropriate.  The p-value currently is the commonly used (and misused)measure of this surety.  With the given sample size, one may calculate an expected interval of p-values [as the measure of surety] on the low side and on the high side that would be in keeping with credibility.  P-values outside this interval would call for inconclusive results.

    Raoul Burchette, KP Research and Evaluation

    ------------------------------
    Raoul Burchette
    Statistical Programmer
    Kaiser Permanente



  • 25.  RE: Banning p-values, continued

    Posted 10-19-2015 09:33


    George,

    To restate your conclusion a little less catchilly, but perhaps were we might hope to end up: Rules should not be applied absolutely, they should be kept in their place, particularly in as rich a domain as the selection of scientific articles.

    So the same thought perhaps applies to p-values? Their technical short comings have been well described by those with a better technical grasp than I, (e.g. Stephen Goodman http://www.ncbi.nlm.nih.gov/pubmed/18582619 and http://www.ncbi.nlm.nih.gov/pubmed/0010383371, and Andrew Gelman - see link another reply in this group) but can I suggest that p-values also have a detrimental effect on the way statistics is taught and done?

    Firstly p-values are somewhat curious concept with an awkward definition, secondly they are embedded in this very rigid and limited experimental analysis framework of two hypotheses and rejecting the null.

    So much early statistical teaching time is spent re-educating the potential statistician in the associated unnatural thought processes. With a result that they end up with a language that cuts them off from any non-statistical audience, and those for whom statistics is only a tool and not their main profession are left with this cramped and limited process that they can only apply rigidly because they can't quite recall what it all really meant, so all they trust are the statistical recipes they were taught.

    A further problem comes because as a source of information a p-value is a very paltry thing, and yet it has attained this peculiar primacy in reporting results. Part of this primacy is because it can be calculated and the "analytical control of type-1 error" has become a  much valued property. While control of type-1 error has its uses, many times it ought to be a secondary issue and yet it has become this totem of statistical analysis.

    This "analytical control of type-1 error" has an odd effect, indeed it is an odd and very slippery concept - try to develop a new statistical test and you will start to appreciate how tricky is to be sure you have considered all the confounding factors and how tricky the concept of "the probability of seeing the same results *or worse*" can be once we stray from single measures of a single value. Indeed the love of the "analytical control of type-1 error" is so strong that people believe they have it when they do not (simply consider the possible effects of missing data or the non-normality of data, such as range truncation).  The harmful effect of which is their reluctance to embrace analytical methods that might be more powerful, but for which it is only possible to estimate the type-1 error control through simulation.

    ------------------------------
    Tom Parke, Tessella