hypothesis formulation

8. RE: hypothesis formulation

Recommend

Eugene Komaroff

Posted 07-21-2023 07:59

I see you switched the topic from p-values to confidence limits on a Rho parameter. OK, no problem. Observed r's and their corresponding p-values are very strongly correlated or essentially are two different views of the same latent variable (Komaroff, 2020).

Komaroff, E. (2020) Relationships between p-values and Pearson correlation coefficients, type 1 errors and effect size errors, under a true null hypothesis. Journal of Statistical Theory and Practice, 14, 49 - 59. https://doi.org/10.1007/s42519-020-00115-6

In reply to the question from James Hawkes: Wellek (2010) may be one possible source for the relatively recent cosmetic addition of the nonequivalence symbol to the null hypothesis statement. I say "cosmetic" because Welleck (2010) wrote: "Accordingly, in this book, our main objects of study are statistical decision procedures which define a valid statistical test at some prespecified level alpha in the 0,1 domain of the null hypothesis." Next, I will paraphrase Wellek's null hypothesis statement because to avoid any misunderstanding of the notation. Wellek now proceeds to make two null hypothesis statements: (1) Theta < = Theta (lower) and (2) Theta => Theta (upper). The lower and upper thetas are the same point estimate in the null hypotheses statements of nonequivalence, but with different math signs ( + & -). If these two null hypotheses are rejected, and they both must be rejected, that leaves the alternative hypothesis of equivalence. This is an example of brilliant statistical reasoning that is firmly rooted in the classical Fisher paradigm of null hypothesis significance testing. Equivalence testing does not replace statistical significance but adds a very clever and theoretically legitimate extension of Fisher's null hypothesis testing paradigm that can be easily applied in practice.

Wellek (Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC, 2010

Dear Sander Greenland, I am astonished. You banned the beautiful tool known as statistical significance for decision making with small sample sizes and replaced it with the equivalence test? You were thinking too fast. The word "test" should have given you pause for thought. If you doubt me, ask the brilliant statisticians at SAS Institute who wrote a TOST algorithm with a default alpha = .05 as the cutoff for statistical significance. Their output contains both the confidence limits of equivalence and the p - value. One resolves the same dilemma: "equivalent" or "not equivalent" by either evaluating the 90% upper and lower limits of equivalence, or simply pondering the statistical significance of the random p-value. Nonetheless, when you reject the null hypothesis, you are left with only "one" provisional sample estimate of the true margin of equivalence.

Fisher (1973) was clear on this point: "The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available" (pg. 8).

I will add to this only that in my opinion a small, narrow, or tight alternative hypothesis, which is the margin of equivalence, requires many replications to ensure validity (accuracy), regardless of sample size. How many? That can be investigated empirically with computer simulation. A good topic for a graduate student looking for a dissertation research project.

Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). Oxford University Press.

------------------------------
Eugene Komaroff
Professor of Education
Keiser University Graduate School
------------------------------

Original Message

9. RE: hypothesis formulation

Recommend

Sander Greenland

Posted 07-21-2023 13:57

Dear Eugene,

I believe that you have misunderstood my responses to you. I thus would beg you to please read what I am writing more carefully and respond more carefully. I am continuing to assume that you would not be averse to agreeing with at least some of my responses, and might allow that you may have misunderstood some of those.

Toward those ends, I will attempt another clarification:

1) As a small initial aside: My use of r had nothing to do with a "rho parameter", I was using it simply as a generic known constant, standing for "radius" in the case of an equivalence interval. I see now that I should have used instead c or some letter that would not invite immediate identification with some specific parameter.

2) My use of tests in the Neyman-Pearson (NP) context translates immediately to P-values in the straightforward manner given by Lehmann in Testing Statistical Hypotheses (p. 70 of the 1986 ed.): The NP-Lehmann P-value for H is the smallest α (cutoff) for which rejection will occur (see also Cox, Scand J Stat 1977). Thus the NP P-value for an interval hypothesis H is the smallest α for which H would be rejected. This fact falsifies the notion that it is "impossible to run an interval test in practice". It also provides a positive answer to your request that "If you know of a practical way to test a null hypothesis parameter that is defined by an interval, please share": As I documented, there is a rich literature going back nearly 70 years on how to do interval tests in the NP decision-theoretic framework; those tests in turn immediately yield NP P-values, as per Lehmann.

3) For point hypotheses, the smallest-α definition of a P-value coincides with the older divergence definition of "the value of P" found in Pearson (1900, p. 160), in which the P-value is the tail area cut off by the observed test statistic in a hypothesized distribution (or more generally, in a reference distribution obtained after some sort of factorization or other method for dealing with nuisance parameters). This has led nearly all writings (including Cox's as well as some of my own) to treat the two definitions as if they are equivalent in all cases. But they aren't equivalent for an interval hypothesis H: different concepts of "P-value" can lead to different definitions which yield observed P-values that may differ as much as a factor of two. Such a difference can arise when the smallest-α P-value is derived from the UMPU Hodges-Lehmann test but the divergence P-value is derived by maximizing the likelihood over H relative to the unconstrained maximum (this ML ratio is 1 when the data equal their expectation under some model in H). This P-value difference translates into differences in the 1-α interval estimates obtained by defining the 1-α interval as all points with p>α.

The difference can be seen in older literature, albeit not described in the above fashion. For example, Berger & Hsu 1996 ('Bioequivalence trials, intersection–union tests and equivalence confidence sets', with discussion. Statistical Science 11, 283–319) distinguished equivalence tests derived directly from NP test-optimization principles from equivalence tests derived by examining whether a 1-2α confidence interval fell inside the equivalence interval; in their basic examples the latter interval equals the one derived as all points with a divergence P-value greater than 2α, and the resulting test is the TOST procedure for testing nonequivalence.

4) Nowhere have I called for banning the algorithmic decision procedures traditionally labeled "significance tests" or NP hypothesis tests, or the summaries labeled "confidence intervals". I have instead emphasized that in reading of the actual research literature in biomedical and social sciences and popular articles, my colleagues and I have found that the labels "significance" and "confidence" continue to be taken incorrectly by everyday researchers and reporters in the ways we listed in our 2016 TAS article (cited on the last round here). These statistical terms remain widely confused with practical significance or with posterior probabilities or intervals derived from contextually sensible prior distributions. This confusion often leads to grossly incorrect scientific claims such as that a study "showed there is no difference" because p > 0.05 or that there is an important difference because p < 0.05. Statistics training programs do not seem to have had much impact on such problems; thus we have pushed for more accurate ordinary-language labels for the procedures and their outputs, to avoid confusion with practical and Bayesian concepts. Some of us also call for introduction of simple tests and P-values for interval hypotheses in basic training, such as the TOST procedure for nonequivalence.

We are hardly the first to complain about now-traditional terminology. For example, the old-school (inverse-probability) Bayesian Arthur Bowley objected to the term "confidence interval" back in the 1930s when Neyman introduced the term and concept, reportedly calling it a "confidence trick" (see Greenland, S. 2019. Are "confidence intervals" better termed "uncertainty intervals"? No: Call them compatibility intervals. British Medical Journal, 366:15381, https://www.bmj.com/content/366/bmj.I5381). It is worth noting that the relatively neutral term "P-value" for what Fisher called "significance level" had already begun appearing in research literature in the 1920s, and even Fisher sometimes used Pearson's term "value of P" for this tail area; he could have chosen to use that or "P-value" in place of "significance level" in his books. Had he done that and had Neyman used the more accurate term "coverage interval" for his "confidence interval", we could have been arguing about something else today.

Alas, we instead face the seemingly impossible task of altering misleading statistical terms that became popular not even a century ago, yet are rooted in some minds as if religious traditions. That may only reflect how statisticians are as human as the faithful. Still, my colleagues and I have found it shocking that the modest and easy reforms we propose have been met with so much misrepresentation, derision, and even sarcasm, rather than with acknowledgement of the serious problems we are addressing. We would hope for constructive responses from those in the field of education, which is supposed to seek improvement in teaching and understanding.

Examples of destructive responses include dismissive remarks that "it's just semantics". In reality, semantics (word meanings) are pivotal to understanding and communicating with nonmathematical colleagues and the general public. True, mathematical theory doesn't care how we label its objects. But every politicians knows labels can make all the difference in describing the world. If you think such common sense about semantics deserves dismissal from applied statistics, I invite you to conduct this experiment: The next time you discuss results of studies that examine the relation of ethnicity to some outcome, substitute the n-word for "black" or "African-American", then please report back on the ensuing change in understanding of your intent.

Sincerely,

------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
------------------------------

Original Message

Original Message:
Sent: 07-21-2023 07:58
From: Eugene Komaroff
Subject: hypothesis formulation