ASA Connect

 View Only
Expand all | Collapse all

hypothesis formulation

  • 1.  hypothesis formulation

    Posted 07-14-2023 11:49

    Sometime a little after 2000 introductory stat books started changing the null hypothesis to a strict equality and the alternative was always strictly >, <, or not equal.   Before 2000 most intro stat books used the opposite inequality in the null when  the alternative was expressed as > or <.  Aside from the fact that the strict equality in the null is the "worst case" for the null, are there any other reasons underlying this change.   I would appreciate if someone could point me to any published discussion on this topic.   Also, would appreciate hearing any thoughts on the subject.

    Thanks

    Jim Hawkes



    ------------------------------
    James Hawkes
    Retired
    ------------------------------


  • 2.  RE: hypothesis formulation

    Posted 07-17-2023 07:45

    I have always taught and thought of the null hypothesis as a point hypothesis. The idea is that we are using it as the basis for calculating a p-value.  "What is the probability of getting an observed t value of 2.1 or greater with 8 df if the true difference is 0?"

    If we change "true difference is 0" to "true difference is less than or equal to 0" I don't see a non-Bayesian way to precisely answer the question. Obviously people have always done the math as though the null was "no difference" and argued that p would be even smaller if the true difference was negative, but that isn't an exact result.

    Also, I teach that a one-tailed test means that you have ruled out the opposite direction, so the options are true difference 0 and > 0 (in this example).  Why would I have to include < 0 in the null if I am declaring that range to be out of consideration in the first place?

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 3.  RE: hypothesis formulation

    Posted 07-18-2023 15:00

    The usual theoretical justification for using a point null hypothesis is indeed that it represents a worst case -- a maximum type 1 error among all null values given by an inequality.  This is based on assumptions about the probability model -- for example, that the probability densities have monotone likelihood ratio.  Things quickly get more complicated when the family fails to have such a property.  So there may not be a universal argument giving theoretical support to this.  But I'm not sure one is forced to adopt a Bayesian approach here.

    If the issue is to present this material in an introductory course, one can give level with the students.  Point null hypotheses are often not realistic, but students can understand that they initially make it easier to understand the basic theory of testing.  The substitution of a point null hypothesis for a more general one can be explained by simply saying that "in many cases" it gives the worst case -- the maximum probability of a type 1 error.  The lecturer is not lying, and the "many cases" probably include all the examples that students will meet in such a course.  If there's time, students can even see this by plotting the power function.



    ------------------------------
    Jay Beder
    Professor Emeritus
    University of Wisconsin-Milwaukee
    ------------------------------



  • 4.  RE: hypothesis formulation

    Posted 07-17-2023 10:08

    Inequality for a null hypothesis parameter is conceptually understandable, but is mathematically impossible if one believes in the theory of sampling distributions.  An exact or specific value is required to evaluate the distance (deviation) between the statistic and parameter, with a p-value from the relevent sampling distribution.  Statistical significance is a probabilistic rejection of an exact or specific the null parameter that is the center of the sampling distribution - so what remains after rejecting the exact null?. An infinite collection of reasonable parameters except for that one the specific one under the null. For me, an inequality in the null hypothesis statement is someone planning to run an infinite amount of significance tests and that will surely exhaust the most powerful, quantum computer, not to mention the researcher who has to interpret the results. 

    Yes, confidence intervals serve as reasonable bounds for an alterrnative parameter; however, in my experience students conflate precision with accuracy. In the theory of sampling distributions, increasing sample size sharpens precision (reliability), but that says nothing about the accuracy (validity) of the "many paramerter estimates" within the interval.  



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 5.  RE: hypothesis formulation

    Posted 07-19-2023 12:48

    You might find of interest this recent article with discussion and rejoinder (unfortunately, all printed piecemeal; sorry I don't have the DOIs for the discussant contributions but they are cited in the rejoinder): 

    Greenland S (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice. Scandinavian Journal of Statistics, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12625
    Rejoinder to discussants: Greenland S (2023). Connecting simple and precise p-values to complex and ambiguous realities. Scandinavian Journal of Statistics, https://doi.org/10.1111/sjos.12645

    While all that may be much too involved for the present discussion, I think the point in the main title is relevant here. That point (with which all the journal discussants agreed) was that there are two logically and mathematically distinct ways of conceptualizing, defining, deriving and interpreting P-values. For simple point hypotheses the two coincide numerically, and hence are usually not distinguished and are even thought to be identical concepts. The two conceptualizations can nonetheless lead to different P-values when the tested hypothesis specifies that the distribution generating the data is in a model subspace defined in part by inequalities, as with interval hypotheses (as one-sided hypotheses are often formulated).

    In the present discussion, I have the sense that some of the consternation reflects a clash between intuitions arising from the separate conceptualizations. Thus, while the following formulation is far removed from elementary statistics, I think that it could explain the differences among views of point hypotheses and one-sided hypotheses.

    The first type of P-value corresponds to a geometric treatment of chi-squared tests of model families as introduced by Karl Pearson (1900), and later adopted for point hypotheses by R.A. Fisher. This P-value is simply the ordinal location in a reference distribution of a measure of divergence between the data and the hypothesized model subspace. In this conceptualization there is no mention or use of error types; the P-value simply serves as part of a description of the sample discrepancy from what would be expected under the nearest distribution in the hypothesized model subspace. 

    The second type of P-value is arises from "optimal" Neyman-Egon Pearson (NP) decision (hypothesis testing) rules; it the minimum alpha level at which rejection of the model subspace can be declared. Error control over repeated sampling (rather than description of a sample discrepancy from an expectation) is the paramount consideration. A consequence of this focus can be a type of incoherent single-sample property of UMPU (Hodges-Lehmann, HL) P-values for interval hypotheses, as described by Schervish (TAS 1996) - a problem not shared by divergence P-values. For interval hypotheses, the summary divergence P-value can be as much as twice the UMPU P-value; this difference can appear dramatic when the two P-values straddle a sharp cutoff (e.g., if the divergence p = 0.06 but the decision p = 0.03, and alpha = 0.05), but is quite small in information-theoretic terms (representing at most one bit of information difference). 



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 6.  RE: hypothesis formulation

    Posted 07-20-2023 07:45

    P-values are continuous random variables; therefore perfectly sensible to talk about the probability density function of a p-value distribution. In Fisher's time the probability of a p-value was derived by integration over an infinitesimally small interval as an area under the standard normal curve. The limits of integration is the only interval that makes sense in a discussion about the meaning of a p-value.

    In the T-test procedure in SAS, there is an option: H0 = m where m can be any specific parameter value  – it does not have to be zero. However, H0 <= m or H0 => m option is not an option, so impossible to run such a test in practice. Although apparently fun to think about in theory with words and statistical notation. If you know of a practical way to test a null hypothesis parameter that is defined by an interval, please share.  



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 7.  RE: hypothesis formulation

    Posted 07-20-2023 12:56

    Dear Eugene Komaroff:

    There is a huge literature on testing interval hypotheses in practice, including textbooks; some key references are given in the articles I cited earlier, in Wellek (Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC, 2010, which provides real examples of application), and in the Wikipedia entry on equivalence tests.

    Any common one-sided P-value for the constraint θ ≤ r (i.e., θ is in the half interval bounded by r) will provide a valid (size ≤ α) NP decision rule or test of H: θ ≤ r by comparing the P-value to α. There are several straightforward adaptations of familiar tests that arise from conjunctions or disjunctions of such one-sided hypotheses. Among them are noninferiority and superiority tests, which test special one-sided hypotheses; minimum-important difference (MID) tests, which test whether θ is inside an interval of radius r around 0, H: -r ≤ θ ≤ r (the intersection of the half interval above -r and the half interval below r); and equivalence tests, which are actually tests of nonequivalence in that their test hypothesis is that θ is outside the interval, H: θ ≤ -r or r ≤ θ (the union of the half interval below -r and the half interval above r). 

    A view shared by many familiar with this literature is that such tests are long overdue for incorporation into basic training. One reason is that they help prevent common misinterpretations of conventional point-hypothesis tests of the sort described in

    Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf

    As the Wiki entry states, equivalence tests may "prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect. Furthermore, equivalence tests can identify effects that are statistically significant but practically insignificant, whenever effects are statistically different from zero, but also statistically smaller than any effect size deemed worthwhile."

    Interval tests also have many important applications. Although they go back at least to 1954 when Hodges & Lehmann introduced a general method for NP testing of interval hypotheses, they began to get close attention from applied statisticians in the 1970s when the aforementioned interval hypotheses arose in the biopharmaceutical literature. As the Wiki entry states: "Equivalence tests were originally used in areas such as pharmaceutics, frequently in bioequivalence trials. However, these tests can be applied to any instance where the research question asks whether the means of two sets of scores are practically or theoretically equivalent. As such, equivalence analyses have seen increased usage in almost all medical research fields. Additionally, the field of psychology has been adopting the use of equivalence testing...equivalence tests have recently been introduced in evaluation of measurement devices,[7][8] artificial intelligence[9] as well as exercise physiology and sports science.[10] Several tests exist for equivalence analyses; however, more recently the two-one-sided t-tests (TOST) procedure has been garnering considerable attention. As outlined below, this approach is an adaptation of the widely known t-test." 

    My applied area is health and medical research where these interval-hypothesis methods are sorely needed. I am unfamiliar with the field of education but I would guess it is akin to psychology in having good use for interval methods; if so I shall hope you can get them incorporated into basic educational statistics if they are not already there.

    As for the distinction between P-values from NP tests of intervals and Fisherian P-values for divergences from intervals, that is taken up at length in the Greenland 2023 SJS article. Briefly, divergence P-values are a type of summary description of how data diverge from the region of expectations that conform perfectly to a hypothesis H. These summary divergences take on familiar forms such as squared Z-statistics, squared t-statistics, chi-squared statistics, and likelihood-ratio statistics. Suppose for example that μ is a normal (Gaussian) mean, and H defines a simple closed interval around 0, H: -r ≤ μ ≤ r, as in an MID problem. The divergence statistic d for H is then the squared distance of the sample mean m from the interval divided by the standard error of m; thus d=0 when -r ≤ m ≤ r (i.e., when the sample mean conforms perfectly to H). The divergence P-value for H is the largest two-sided P-value for all means μ that are in the interval (i.e., it is the maximum two-sided P-value over all H: μ = c where -r ≤ c ≤ r). Thus if m is in the hypothesized interval (-r ≤ m ≤ r) the divergence P-value will be 1, because the two-sided P-value for H: μ = m is 1. In contrast, if the interval is many standard errors wide and m falls on an interval boundary (m=-r or m=r), the UMPU (Hodges-Lehmann) P-value from NP testing of the interval will approach 0.5. As an extreme case, the P-value from the NP-test of H: μ ≤ r is the ordinary one-sided P-value, which is always strictly less than 1 and is 0.5 when m=r; whereas the divergence P-value for the same H: μ ≤ r equals 1 whenever m ≤ r.

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 8.  RE: hypothesis formulation

    Posted 07-21-2023 07:59

    I see you switched the topic from  p-values to confidence limits on a Rho parameter. OK, no problem. Observed r's and their corresponding p-values are very strongly correlated or essentially are two different views of the same latent variable (Komaroff, 2020).

    Komaroff, E. (2020) Relationships between p-values and Pearson correlation coefficients, type 1 errors and effect size errors, under a true null hypothesis. Journal of Statistical Theory and Practice, 14, 49 - 59. https://doi.org/10.1007/s42519-020-00115-6

    In reply to the question from James Hawkes: Wellek (2010) may be one possible source for the relatively recent cosmetic addition of the nonequivalence symbol to the null hypothesis statement. I say "cosmetic" because Welleck (2010) wrote: "Accordingly, in this book, our main objects of study are statistical decision procedures which define a valid statistical test at some prespecified level alpha in the 0,1 domain of the null hypothesis." Next, I will paraphrase Wellek's null hypothesis statement because to avoid any misunderstanding of the notation. Wellek now proceeds to make two null hypothesis statements: (1) Theta < = Theta (lower) and (2) Theta => Theta (upper). The lower and upper thetas are the same point estimate in the null hypotheses statements of nonequivalence, but with different math signs ( + & -).  If these two null hypotheses are rejected, and they both must be rejected, that leaves the alternative hypothesis of equivalence. This is an example of brilliant statistical reasoning that is firmly rooted in the classical Fisher paradigm of null hypothesis significance testing. Equivalence testing does not replace statistical significance but adds a very clever and theoretically legitimate extension of Fisher's null hypothesis testing paradigm that can be easily applied in practice.

    Wellek (Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC, 2010

    Dear Sander Greenland, I am astonished. You banned the beautiful tool known as statistical significance for decision making with small sample sizes and replaced it with the equivalence test? You were thinking too fast. The word "test" should have given you pause for thought. If you doubt me, ask the brilliant statisticians at SAS Institute who wrote a TOST algorithm with a default alpha = .05 as the cutoff for statistical significance. Their output contains both the confidence limits of equivalence and the p - value. One resolves the same dilemma: "equivalent" or "not equivalent" by either evaluating the 90% upper and lower limits of equivalence, or simply pondering the statistical significance of the random p-value. Nonetheless, when you reject the null hypothesis, you are left with only "one" provisional sample estimate of the true margin of equivalence.

    Fisher (1973) was clear on this point: "The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available" (pg. 8).

    I will add to this only that in my opinion a small, narrow, or tight alternative hypothesis, which is the margin of equivalence, requires many replications to ensure validity (accuracy), regardless of sample size. How many? That can be investigated empirically with computer simulation. A good topic for a graduate student looking for a dissertation research project. 

    Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). Oxford University Press.



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 9.  RE: hypothesis formulation

    Posted 07-21-2023 13:57

    Dear Eugene,

    I believe that you have misunderstood my responses to you. I thus would beg you to please read what I am writing more carefully and respond more carefully. I am continuing to assume that you would not be averse to agreeing with at least some of my responses, and might allow that you may have misunderstood some of those.

    Toward those ends, I will attempt another clarification:

    1) As a small initial aside: My use of r had nothing to do with a "rho parameter", I was using it simply as a generic known constant, standing for "radius" in the case of an equivalence interval. I see now that I should have used instead c or some letter that would not invite immediate identification with some specific parameter.

    2) My use of tests in the Neyman-Pearson (NP) context translates immediately to P-values in the straightforward manner given by Lehmann in Testing Statistical Hypotheses (p. 70 of the 1986 ed.): The NP-Lehmann P-value for H is the smallest α (cutoff) for which rejection will occur (see also Cox, Scand J Stat 1977). Thus the NP P-value for an interval hypothesis H is the smallest α for which H would be rejected. This fact falsifies the notion that it is "impossible to run an interval test in practice". It also provides a positive answer to your request that "If you know of a practical way to test a null hypothesis parameter that is defined by an interval, please share": As I documented, there is a rich literature going back nearly 70 years on how to do interval tests in the NP decision-theoretic framework; those tests in turn immediately yield NP P-values, as per Lehmann. 

    3) For point hypotheses, the smallest-α definition of a P-value coincides with the older divergence definition of "the value of P" found in Pearson (1900, p. 160), in which the P-value is the tail area cut off by the observed test statistic in a hypothesized distribution (or more generally, in a reference distribution obtained after some sort of factorization or other method for dealing with nuisance parameters). This has led nearly all writings (including Cox's as well as some of my own) to treat the two definitions as if they are equivalent in all cases. But they aren't equivalent for an interval hypothesis H: different concepts of "P-value" can lead to different definitions which yield observed P-values that may differ as much as a factor of two. Such a difference can arise when the smallest-α P-value is derived from the UMPU Hodges-Lehmann test but the divergence P-value is derived by maximizing the likelihood over H relative to the unconstrained maximum (this ML ratio is 1 when the data equal their expectation under some model in H). This P-value difference translates into differences in the 1-α interval estimates obtained by defining the 1-α interval as all points with p>α.

    The difference can be seen in older literature, albeit not described in the above fashion. For example, Berger & Hsu 1996 ('Bioequivalence trials, intersection–union tests and equivalence confidence sets', with discussion. Statistical Science 11, 283–319) distinguished equivalence tests derived directly from NP test-optimization principles from equivalence tests derived by examining whether a 1-2α confidence interval fell inside the equivalence interval; in their basic examples the latter interval equals the one derived as all points with a divergence P-value greater than 2α, and the resulting test is the TOST procedure for testing nonequivalence.

    4) Nowhere have I called for banning the algorithmic decision procedures traditionally labeled "significance tests" or NP hypothesis tests, or the summaries labeled "confidence intervals". I have instead emphasized that in reading of the actual research literature in biomedical and social sciences and popular articles, my colleagues and I have found that the labels "significance" and "confidence" continue to be taken incorrectly by everyday researchers and reporters in the ways we listed in our 2016 TAS article (cited on the last round here). These statistical terms remain widely confused with practical significance or with posterior probabilities or intervals derived from contextually sensible prior distributions. This confusion often leads to grossly incorrect scientific claims such as that a study "showed there is no difference" because p > 0.05 or that there is an important difference because p < 0.05. Statistics training programs do not seem to have had much impact on such problems; thus we have pushed for more accurate ordinary-language labels for the procedures and their outputs, to avoid confusion with practical and Bayesian concepts. Some of us also call for introduction of simple tests and P-values for interval hypotheses in basic training, such as the TOST procedure for nonequivalence. 

    We are hardly the first to complain about now-traditional terminology. For example, the old-school (inverse-probability) Bayesian Arthur Bowley objected to the term "confidence interval" back in the 1930s when Neyman introduced the term and concept, reportedly calling it a "confidence trick" (see Greenland, S. 2019. Are "confidence intervals" better termed "uncertainty intervals"? No: Call them compatibility intervals. British Medical Journal, 366:15381, https://www.bmj.com/content/366/bmj.I5381). It is worth noting that the relatively neutral term "P-value" for what Fisher called "significance level" had already begun appearing in research literature in the 1920s, and even Fisher sometimes used Pearson's term "value of P" for this tail area; he could have chosen to use that or "P-value" in place of "significance level" in his books. Had he done that and had Neyman used the more accurate term "coverage interval" for his "confidence interval", we could have been arguing about something else today.

    Alas, we instead face the seemingly impossible task of altering misleading statistical terms that became popular not even a century ago, yet are rooted in some minds as if religious traditions. That may only reflect how statisticians are as human as the faithful. Still, my colleagues and I have found it shocking that the modest and easy reforms we propose have been met with so much misrepresentation, derision, and even sarcasm, rather than with acknowledgement of the serious problems we are addressing. We would hope for constructive responses from those in the field of education, which is supposed to seek improvement in teaching and understanding.

    Examples of destructive responses include dismissive remarks that "it's just semantics". In reality, semantics (word meanings) are pivotal to understanding and communicating with nonmathematical colleagues and the general public. True, mathematical theory doesn't care how we label its objects. But every politicians knows labels can make all the difference in describing the world. If you think such common sense about semantics deserves dismissal from applied statistics, I invite you to conduct this experiment: The next time you discuss results of studies that examine the relation of ethnicity to some outcome, substitute the n-word for "black" or "African-American", then please report back on the ensuing change in understanding of your intent.

    Sincerely,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 10.  RE: hypothesis formulation

    Posted 07-23-2023 04:22

    Dr. Greenland.  

    I accused you of banning statistical significance without any evidence as proof.  You said you are "not guilty" so please forgive my presumption.

    Re: "Thus the NP P-value for an interval hypothesis H is the smallest α for which H would be rejected. This fact falsifies the notion that it is "impossible to run an interval test in practice". It appears your "H" is not a null hypothesis but a research hypotheses. I also understand that H could be vector, but is it possible to use your method, without loss of generality, if there is only one H. In Fisher's method alpha is a constant, but in your method apparently alpha is a random variable.

    Imagine the data in the table below came from a pilot RCT to prove that an experimental or novel treatment (Group A) is more effective than the standard of care (Group B) for treating some medical condition. The endpoint is measured on a continuous scale where a higher number indicates better result. A mathematician might say obviously the treatment A is better because 10 ≠ 4.  For a statistician this fact is necessary but not sufficient. A statistician has a choice of analyses, but would mostly likely run an independent samples two-tailed T-test with Ho: mu(A) = mu(B), alpha = .05. Please show me how you would analyze these data with your "interval test" method. 

    Group A

    Group B

    11

    2

    14

    9

    7

    0

    8

    5

    Mean

    10

    4

    St. Dev.

    3.2

    3.9

    n

    4

    4



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 11.  RE: hypothesis formulation

    Posted 07-24-2023 14:06

    Prof. Komaroff,

    There seems to be some problems in the different ways we are using terms, which I will try resolve in items 1-3 below (allowing that I may be misunderstanding your responses).
    I will then address your example in item 4 and address the more general issues we have raised in items 5-6:

    1) I'm not sure what you mean by "your H is not a null hypothesis but a research hypotheses". There is inconsistency in the use of "null hypothesis"; it appears that Fisher and most statisticians since have used it to mean any set H of constraints on a distribution that will be subject to evaluation. This is at odds with its ordinary English meaning of "null" for zero or nothing. Neyman eventually broke from this Fisherian usage and instead used "tested hypothesis" for H (see Neyman, Synthese 1977); I encourage that usage or "test hypothesis", unless H really does correspond to a constraint that there is no association or no effect. Similarly, Lehmann called arbitrary H including interval hypotheses "statistical hypotheses"; see for example Hodges & Lehmann, AMS 1954.

    2) You mentioned "Fisher's alpha". My impression on reading Fisher is that he would have objected, as he did extensively to NP theory. In the latter, alpha is a known pre-specified design constant to be used along with its complementary design constant of power at pre-specified alternatives. Fisher used cutoffs only as rough guides in combination with other considerations. Hence I have seen no place where he called his cutoffs alpha or implied a cutoff was a decision mandate (in tandem, he wrote of sensitivity rather than power, again without mandates or functional relations to cutoffs). Neyman and Fisher realized the depth of their split by the end of the 1930s, and it became quite acrimonious.

    Many authors since have lamented the confusion created by mixing their two formulations of statistical inference as if equivalent, when they are separated by several distinctions that neither dismissed as nuances. I found this out the hard way when in grad school at UC Berkeley I was an RA on one of Neyman's projects and took his class (in which I was harshly schooled in the distinctions between NP and Fisherian formulations). A few decades later Goodman wrote an exposition on these often-missed distinctions (Goodman, S.N. 1993. P Values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical
    debate. American Journal of Epidemiology, 137, 485–496).

    3) You wrote "in your method apparently alpha is a random variable". While I would love to have the credit, I'm afraid it is not "my" method, and in it alpha is not random. In this thread, for simplicity and to avoid confusion with Fisherian notions, I have stuck with the orthodox Neyman-Pearson formalization as expounded in detail in by Lehmann in his Testing Statistical Hypotheses, in which alpha is a fixed, explicit constant. There is also a random variable defined by Lehmann that in each sample yields the minimum alpha (not the prespecified design alpha) required for rejection, a realization which he defines as the P-value on p. 70 of the 1986 edition of TSH. Again, for interval H that P-value is not identical to the divergence P-value derived (for example) by maximizing the ratio of the likelihood constrained by H vs. the unconstrained likelihood (see my 2023 SJS article for a review).

    4) Your numeric example is not fleshed out enough for me to answer specifically; e.g., the sample sizes (4 in each group) are too small to justify a t-test via simple asymptotics, so are you assuming (as did Student) that the individual outcomes are Gaussian with common variance across patients and treatments? What is the target population for inference, the 8 patients or some superpopulation they are supposed to represent? Are you assuming a constant mean shift in going from treatment B to A in this target, whatever it is? If not, what is the distribution of effects? I presume you are assuming randomization to groups or some related identification condition, without which nothing could be said about effectiveness even with all the preceding items specified.

    Perhaps though the issue you are concerned about can be captured as follows when all the above specifics are answered:
    Following Neyman (1977) suppose declaring superiority of A is idealized as declaring the "true" mean difference d to be positive, and that declaring A superior when it is not is the error most important to avoid (much more costly than failing to find superiority of A when d is positive). We have specified a one-sided H: d ≤ 0 and thus I would use the corresponding one-sided P-value p to implement the NP decision rule in which H is rejected if p≤alpha, given this set-up, because all other things given, its size (Type-I error rate) would not exceed alpha when d is in H (it would be a valid rule), and it would provide the most power among valid unbiased rules (valid rules whose power never fell below their size). 

    5) To the best of my knowledge, the habitual use of a two-sided P appears to stem from Fisher. It can be defended on various grounds that I usually don't see given in problem statements and applications in medical research (along with much else often not given from the list above). For example, suppose the side was not really prespecified and instead the data made the choice; then the two-sided penalty of doubling the smaller of the one-sided P-values corrects for that choice in a way familiar in multiple comparisons and information theory: doubling p results in a decrement of one bit in the surprisal, losing the direction bit (the data information used to make the direction choice): s = -log2(2p) = -log2(p)-1. Some of the other discussants here gave other rationales. Yet another rationale is that I would rather see a descriptive P-value rather than an NP-testing P-value, a point I will return to below.

    6) Whether we take H or the P-value as one or two sided, I should hope we agree that the above NP idealization doesn't begin to capture what would be important in real decision problems, such as the fact that the error costs will be a function the actual differences made by treatments across patients. This problem is not addressed at all by lowering alpha or switching to Bayes factors as decision criteria. That  is because the entire conceptualization of these problems as hypothesis tests or decision rules is deceptive insofar as it ignores such crucial details - details which the data could inform. (NP also encourages decisions based on single studies rather than complete literature review, a massive topic in itself.).

    There is a huge literature on decision theory that takes account of those details, including the extension of NP developed by Wald in the 1940s and its operational Bayesian counterpart developed in the same era. That pair of theories offer an interesting way of fusing frequentist and Bayes decisions via Wald's theorems showing that Bayes rules are frequentist admissible. Yet despite these theories becoming computationally tractable, the medical research literature has barely taken them up. Instead, article conclusions and medical decisions continue to pivot on whether the P-value (usually for some point hypothesis) passed some threshold, typically with no regard to realistic cost structures. I think that is because the statistical specifications and methods required for using such structures are far too demanding for typical research teams to deploy within the resources they have.

    My view is that the practical way to address this problem is to shift training in and use of P-values back to a more modest descriptive role where they are treated as continuous indices of compatibility between hypotheses or models and data. This is much as they were treated in Pearson 1900 and in some of Fisher's own applications, before they became formalized in NP theory as rigid decision criteria. This treatment of P-values can be found in writings by Cox (who used the term "consistency" for what Fisher called compatibility), Kempthorne (who used the term "consonance"), and Bayarri & Berger (who used the term "compatibility", which I adopted).  Other terms I have seen used for the idea include conformity, concordance, accord, and goodness-of-fit, the latter being the traditional term when compatibility is formalized as the proximity or nearness of the data to statistical expectations under H as measured by common divergence measures, such as a sum of squared standardized residuals or a negative log maximum-likelihood ratio (which are examples of Euclidean and Kullback-Leibler divergences, respectively). We have advocated this shift in many papers, including
    Amrhein, V., Trafimow, D., and Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don't expect replication. The American Statistician, 73 supplement 1, 262-270, www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1543137
    Greenland, S. (2019). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 73, supplement 1, 106-114,
    www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
    Rafi, Z., and Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20, 244. doi: 10.1186/s12874-020-01105-9, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9
    Greenland, S., Mansournia, M., and Joffe, M. (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, https://www.sciencedirect.com/science/article/pii/S0091743522001761
    Amrhein, V., and Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values.
    Journal of Information Technology, 37(3), 316-320. https://journals.sagepub.com/doi/full/10.1177/02683962221105904 

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 12.  RE: hypothesis formulation

    Posted 07-26-2023 04:15

    Dr. Greenland. Before responding to Items 1 – 6, I want to  be clear that I  have no quarrel with Bayesian nor with Neyman Frequentist theory.  At the same time I do not use a hammer to drive a screw and I d not use a sledgehammer to drive a nail. I use the best and most efficient tool that gets the job done. 

    Statistical theory is a three-legged stool: Bradley Efron. "R. A. Fisher in the 21st century (Invited paper presented at the 1996 R. A. Fisher Lecture)." Statist. Sci. 13 (2) 95 - 122, May 1998. https://doi.org/10.1214/ss/1028905930,. I am appalled when a statistical theorists wants to amputate one or two of the legs. Incidentally, the stool turned into a chair with a back rest since the advent of data science. 

    The Null ypothesis (H0) significance test is counterintuitive. Here is an attempt at an intuitive interpretation. My null hypothesis states: When I release a ball from my hand, it will always drop to the ground. To prove this null hypothesis, I must drop the ball infinitely many times, which is impossible. However, if my null hypothesis states: When I release the ball, it will never fall to the ground, once the ball hits the ground the null hypothesis is rejected. 

    RE: your Item 1) "I'm not sure what you mean by 'your H is not a null hypothesis but a research hypotheses'. There is inconsistency in the use of "null hypothesis"; it appears that Fisher and most statisticians since have used it to mean any set H of constraints on a distribution that will be subject to evaluation." 

    Fisher's (1971) clear-cut description of the null hypothesis: "It is evident that the null hypothesis must be exact, that it is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution' of which the test of significance is the solution" (p. 16).

    Fisher (1971). Design of Experiments (8th ed.)  Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995), Oxford University Press.

    RE: Item 2) "Hence I have seen no place where <Fisher> called his cutoffs alpha or implied a cutoff was a decision mandate."

    Fisher (1973) said: "The value for which P=.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant" (p. 44

    Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). Oxford University Press.

    RE: 3) " You wrote 'in your method apparently alpha is a random variable'. While I would love to have the credit, I'm afraid it is not 'my' method, and in it alpha is not random."

    Thank you for clarifying that alpha is not random in NP theory.

    RE: 4) "Your numeric example is not fleshed out enough for me to answer specifically; e.g., the sample sizes (4 in each group) are too small to justify a t-test via simple asymptotics, so are you assuming (as did Student) that the individual outcomes are Gaussian with common variance across patients and treatments?"

    Your concern about violated assumptions is not a problem. I can think of two other analyses that have minimal assumptions but less power. 

    "What is the target population for inference, the 8 patients or some superpopulation they are supposed to represent?"

    My target population is the relevant sampling distribution of means.

    "Are you assuming a constant mean shift in going from treatment B to A in this target, whatever it is?" 

    I am unfamiliar with the "constant mean shift" concept. Are you asking about a patient by treatment interaction effect? 

    "I presume you are assuming randomization to groups or some related identification condition, without which nothing could be said about effectiveness even with all the preceding items specified."

     I stated these data came from a pilot RCT? Please show me how you would run a one-tailed interval test, H: d<= 0 where I assume, "d" = Mu(A) – Mu(B).

    RE  5) "To the best of my knowledge, the habitual use of a two-sided P appears to stem from Fisher." 

    Fisher (1973) wrote: "Some little confusion is sometimes introduced by the fact that in some cases we wish to know the probability that the deviation, known to be positive, shall exceed an observed value, whereas in other cases the probability required is that a deviation, which is equally frequently positive and negative, shall exceed an observed value; the latter probability is always half the former" (p. 45).

    RE: 6) "…real decision problems, such as the fact that the error costs will be a function the actual differences made by treatments across patients." 

    Is this another "constant mean shift" problem of variable costs of treating patients?  

    "That pair of theories offer an interesting way of fusing frequentist and Bayes decisions via Wald's theorems showing that Bayes rules are frequentist admissible."

    Einstein failed in deriving a grand unified theory of the forces in the known universe. However, even if he were successful, succeeded, the theory would have to be revised to accommodate a relatively new dark force.  I doubt the "knotty, nuanced, and contentious" disputes among the theoretical forces in the statistics universe, e.g., Fisher, N-P, Bayes have raged over 100 years. Besides, how would the relatively new data science fit into a theory of everything? 

    "…medical decisions continue to pivot on whether the P-value (usually for some point hypothesis) passed some threshold, typically with no regard to realistic cost structures."

    A license to market a new medication is expensive. Typically, FDA regulators require two statistically significant RCTs. However, FDA is accepting Bayesian submissions for medical device applications and recommending confidence intervals for tests of bioequivalence. Apparently, the FDA shifted their world view from neo positivism to pragmatism.

    "My view is that the practical way to address this problem is to shift training in and use of P-values back to a more modest descriptive role…"

    Return p-values to a more modest descriptive role is a euphemism for "statistical significance - don't say it and don't use."



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 13.  RE: hypothesis formulation

    Posted 07-27-2023 08:52

    I forgot that MS formatting does not work in an HTML environment.  Below is a repeat with my post edited with Notepad.  Dr. Greenland's comment in quotes, my response follows. 

    Dr. Greenland. Before responding to Items 1 – 6, I want to  be clear that I  have no quarrel with Bayesian nor with Neyman Frequentist theory.  However I do not use a hammer to drive a screw into a wall and do not use a sledgehammer to drive a nail. I use the best and most efficient tool that gets the job done. Statistical theory is a three-legged stoo (Efron, 1998). I am appalled when a statistical theorists wants to amputate one or two of the legs. Incidentally, the stool turned into a chair with a back rest since the advent of data science. 

    : Bradley Efron. "R. A. Fisher in the 21st century (Invited paper presented at the 1996 R. A. Fisher Lecture)." Statist. Sci. 13 (2) 95 - 122, May 1998. https://doi.org/10.1214/ss/1028905930.

    The Null Hypothesis (H0) significance test is counterintuitive. Here is an attempt at an intuitive interpretation. My null hypothesis states: When I release a ball from my hand, it will always drop to the ground. To prove this null hypothesis, I must drop the ball infinitely many times, which is impossible. However, if my null hypothesis states: When I release the ball, it will never drop once the ball hits the ground the null hypothesis is rejected. Of course, the setting for this experiment must be in a gravitational field. 

    RE: your Item 1) "I'm not sure what you mean by 'your H is not a null hypothesis but a research hypotheses'. There is inconsistency in the use of "null hypothesis"; it appears that Fisher and most statisticians since have used it to mean any set H of constraints on a distribution that will be subject to evaluation." 

    Fisher's (1971) clear-cut description of the null hypothesis: "It is evident that the null hypothesis must be exact, that it is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution' of which the test of significance is the solution" (p. 16). 

    Fisher (1971). Design of Experiments (8th ed.)  Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995), Oxford University Press.

    RE: Item 2) "Hence I have seen no place where <Fisher> called his cutoffs alpha or implied a cutoff was a decision mandate." 

    Fisher (1973) said: "The value for which P=.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant" (p. 44

    Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). Oxford University Press.

    RE: 3) " You wrote 'in your method apparently alpha is a random variable'. While I would love to have the credit, I'm afraid it is not 'my' method, and in it alpha is not random."

    Thank you for clarifying that alpha is not random in NP theory. 

    RE: 4) "Your numeric example is not fleshed out enough for me to answer specifically; e.g., the sample sizes (4 in each group) are too small to justify a t-test via simple asymptotics, so are you assuming (as did Student) that the individual outcomes are Gaussian with common variance across patients and treatments?" 

    Your concern about violated assumptions is not a problem. I can think of two other analyses that have minimal assumptions but less power. 

    "What is the target population for inference, the 8 patients or some superpopulation they are supposed to represent?" 

    My target population is the relevant sampling distribution of means. 

    "Are you assuming a constant mean shift in going from treatment B to A in this target, whatever it is?"  

    I am unfamiliar with the "constant mean shift" concept. Are you asking about a patient by treatment interaction effect? 

    "I presume you are assuming randomization to groups or some related identification condition, without which nothing could be said about effectiveness even with all the preceding items specified." 

    I stated these data came from a pilot RCT? 

    RE  5) "To the best of my knowledge, the habitual use of a two-sided P appears to stem from Fisher."  

    Fisher (1973) wrote: "Some little confusion is sometimes introduced by the fact that in some cases we wish to know the probability that the deviation, known to be positive, shall exceed an observed value, whereas in other cases the probability required is that a deviation, which is equally frequently positive and negative, shall exceed an observed value; the latter probability is always half the former" (p. 45). I sincerely doubt that "some little confusion" would have addicted him to two-sided p-values. 

    Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). Oxford University Press.

    RE: 6) "…real decision problems, such as the fact that the error costs will be a function the actual differences made by treatments across patients." 

    Is this another "constant mean shift" problem of variable costs of treating patients?  

    "That pair of theories offer an interesting way of fusing frequentist and Bayes decisions via Wald's theorems showing that Bayes rules are frequentist admissible." 

    Einstein failed in deriving a grand unified theory of the forces in the known universe. However, even if he were successful, succeeded, the theory would have to be revised to accommodate a relatively new dark force.  I doubt the 

    The 'knotty, nuanced, and contentious' disputes among the theoretical forces in the statistics universe ( Fisher, N-P, Bayes, etc.)  have raged over 100 years. Besides, I suppose the relatively new Data Science would force a revision of the "Theory of Everything for Statistical Thinking, Reasoning and Literacy."  

    "…medical decisions continue to pivot on whether the P-value (usually for some point hypothesis) passed some threshold, typically with no regard to realistic cost structures." 

    A license to market a new medication is expensive. Typically, FDA regulators require two statistically significant RCTs. However, FDA is accepting Bayesian submissions for medical device applications and recommending confidence intervals for tests of bioequivalence. Apparently, the FDA shifted their world view from neo positivism to pragmatism. 

    "My view is that the practical way to address this problem is to shift training in and use of P-values back to a more modest descriptive role…" 

    Your call to return p-values to a more modest descriptive role is a euphemism for "statistical significance - don't say it and don't use."

     



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 14.  RE: hypothesis formulation

    Posted 07-30-2023 16:00

    Dear Eugene,

    I am just now responding to your comments on Weds, reposted Thurs 28 July:

    0. I hope you can now see I am not trying to amputate anything. The medical analogy is apt however in that I am advocating improved practice in the form of better statistical hygiene. That includes expanding the selection of basic toolkits, and better communication about what the tools do and don't do. This improvement can be aided by less misleading terminology, and by descriptions that fit better into the application context. As I've already belabored, in this advocacy I sometimes feel like Semmelweiss must have felt in the resistance he met for promoting hygienic medical practices.

    BTW I have used the hammer/screwdriver analogy for decades in advocating training and use in a broad spectrum of tools, especially to prevent the enduring misinterpretations of P-values as posterior probabilities (as opposed to bounds on the latter, as in Casella & Berger JASA 1987, or their translation into bounds on Bayes factors, as in Sellke Bayarri Berger TAS 2001 https://www.tandfonline.com/doi/abs/10.1198/000313001300339950), and the misinterpretation of P-values as NP error rates (for which P-values can be badly miscalibrated, especially when near high cutoffs like 0.05; again, see Sellke Bayarri Berger TAS 2001).

    1. Thanks for your quotes from Fisher, which seem to show Fisher describing a "null hypothesis" as limited to a set H0 of sharp (equality) constraints on a distribution. I welcome your correction to my impression that he used the term the way later authors did, e.g., Cox & Hinkley, Theoretical Statistics 1974 (the book that was my first exposure to what some call the neoFisherian perspective) where on p. 64 the term "null hypothesis" is introduced for the lead concept of what they call "pure significance tests":

    "The hypothesis H0 is called the null hypothesis or sometimes the hypoth­esis under test; it is said to be simple if it completely specifies the density fY(y) and otherwise composite. Often composite hypotheses specify the density except for the values of certain unknown par­ameters."

    As can be seen from their formulation of composite hypotheses on p. 132, CH did not restrict test hypotheses to sharp equalities, where they among possible null hypotheses they include hypotheses that "the mean is in some interval". 

    Neyman used the term "tested hypothesis" while Lehmann used "statistical hypothesis" for what CH called "null hypothesis". Like CH, neither of those required that the hypothesis be limited to equalities. As I cited before, Hodges & Lehmann AMS 1954 explicitly discussed and provided NP tests for interval hypotheses (which are defined solely by inequalities); from those one can define NP P-values for interval hypotheses. As reviewed in my recent SJS paper, divergence P-values for inequality hypotheses are directly derivable and easily computed from familiar statistics; with suitable regularity (e.g., that used in GLM theory) they often reduce to the supremum of two-sided P-values over H, as in the example you posed.

    CH described the observed "significance level" pobs of their significance tests "as a measure of the consistency of the data with H0". Kempthorne & Folks (Probability, Statistics, and Data Analysis, 1971) used "consonance" for consistency, while Bayarri & Berger (JASA 2000) used "compatibility", which I adopted. It seems clear that Karl Pearson used "goodness of fit" in the same way, as we do when we use the "value of P" as a measure of the goodness of fit of the model to the data. For me the 0.05 convention has its most convincing rationale for diagnostic purposes, insofar as p>0.05 is taken to indicate that a particular diagnostic did not flag a problem (as long as we bear in mind that other diagnostics may detect important misfit). I find all these terms preferable to "significance" and "confidence" language, which suggest to readers that equivalence intervals and loss (harm-benefit) functions for errors have been used to derive "significance" cutoffs (as would be needed for assessing clinical significance) when they haven't, and that posterior probabilities have been generated when they haven't. 

    As I explained earlier, I came from a Neyman-Lehmann program which rejected use of "null hypothesis" except when the hypothesis was that a parameter or relation is null in the ordinary English sense, as in "null and void". I can now add another objection to Fisher for his forbidding of inequalities in test hypotheses, a restriction rendered unnecessary by other developments both in the neo-Fisherian and Neyman-Pearsonian branches of frequentism. Fisher was of course known for ignoring or dismissing developments that came after his own main contributions, such as both Neyman-Pearson-Wald and Bayesian decision theories (e.g., see Fisher JRSS B 1955 and the separate responses by Neyman and Pearson).

    I thus think Fisher went too far in excluding inequalities from H; but then, being neoFisherian means correcting Fisher's errors or updating his ideas as needed. I also depart from NP in that I think it preferable to describe all P-values as indices of fit to data of the combination of H and the set of background assumptions A used to produce the P-value, rather than in terms of decision rules. That sort of description helps one to see how P-values can always be interpreted as continuous diagnostic measures for hypotheses and models, whether or not any action point (cutoff) is specified. For example, the combination of H with background assumptions (e.g., H: no effect combined with background A: randomization of treatment) leads to predictions about how particular statistics (data summaries) will appear; so p=10^-8 for H indicates that a particular prediction is very far from the data in a familiar statistical sense (usually based on SDs or loglikelihoods as found in t and χ2 statistics), while p=0.60 indicates that the prediction is (in the same sense) not far from the data.

    2. Your further quotes from Fisher did not show where he used any of NP terms like "alpha level" or "Type-I error" or "test size", so I am not sure what your intent was with those quotes.
    As for "decision mandate", I apparently take that phrase in a stronger sense than you do, for I do not see it in your Fisher quote,

    "The value for which P=.05, or 1 in 20, is 1.96 or nearly 2... is convenient to take ... as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant."

    To me that is not mandating any decision; it is simply recommending a labeling convention for "significant" in which Fisher himself allowed contextual flexibility, called arbitrary at points, and later softened when he famously said

    "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" (Statistical Methods and Scientific Inference, Oliver and Boyd 1956).

    That is in sharp contrast to NP theory in which a test hypothesis H0, an alpha-level (not necessarily 0.05), and an alternative H1 are pre-specified based on context (including error costs) so that, along with power considerations, the theory mandates the decision to reject H0 in favor of H1 if p<α. In contrast to the Fisher quote, NP hypothesis testing is an algorithmic approach that Fisher assailed for research but which caught on nonetheless, perhaps because of its apparent utility for quality control (acceptance sampling).

    3. seems resolved (at least I hope so).

    4. Regarding your example, see my response to your other post in which you repeated your main inquiry in simpler form. I hope that resolves the issue. I also see Constantine posted a helpful answer in response to Ed Gracely.

    5. I did not mean to imply that Fisher had some confusion on the matter. I meant that the use of 2-sided tests seemed the mainstay of the original editions of Statistical Methods for Research Workers that trained so many in applied statistics, and thus became a convention which was defended on various grounds. I am fine if you or anyone can supply a better account of the origin of that convention, or demonstrate instead that pre-WWII editions of SMRW gave a thoroughly balanced coverage of 1-sided and 2-sided tests. (My impression has been that postwar editions were displaced in basic instruction by more NP-influenced texts, at least in the U.S.; in that regard, Neyman recounted that his split with Fisher became very personal and final in the mid-1930s when Neyman declined to use SMRW for teaching). I would of course welcome better-informed comments on the history.

    6. There seems to be several unrelated items you listed here, with an undercurrent that struck me as closed to other viewpoints.
     
    For example, the relevance of your comment on Einstein eludes me. I merely mentioned a set of fascinating math-stat decision theoretic results by Wald showing that Bayesian decision procedures were frequentist admissible in a specific sense (Wald also showed a kind of converse, that frequentist-admissible decision procedures were either Bayesian or limits of Bayesian decision procedures). I think Wald's results have not received the attention they deserve. As with most any statistical result, they are useful to the extent one can formalize a real problem enough to apply them sensibly (which I have seen done on rare occasion in medical decision making problems). 


    I did not say Wald's results provide a "theory of everything", for it can't (and nothing can): They are results in formal decision theory, and thus it is not immediately obvious (to me anyway) how they would apply to information-summarization tasks; plus they assume complete utility and probability specifications, so it is not immediately obvious how they would apply to problems lacking such detail (which is to say, most problems I see). But Wald's theory is interesting and worth keeping in mind, at least for foundational thinking, even though Fisher despised it (see Fisher JRSS B 1955) - perhaps because Fisher was ultimately more concerned with information summarization via "fiducial" distributions (which Cox and Kempthorne argued can be properly represented by "confidence" distributions, i.e., P-value functions).

    Finally, you said "Your call to return p-values to a more modest descriptive role is a euphemism for 'statistical significance - don't say it and don't use.' "
    I'd say instead the latter Wasserstein-Schirm-Lazar quote is a quick summary of our call to move P-values back to descriptions of relations of models to data - that is assuming WSL meant don't use the term "statistical significance" for describing your results; give the P-value instead, even if you then note whether it met a pre-specified cutoff, if there was one. This advice does not change any statistical computation; instead, its presentation is made more accurate in ordinary-language terms, in cognizance of the following issue:

    On close scrutiny, few studies in soft sciences can be certified to closely meet all the assumptions of the statistics they use to generate P-values and compatibility ("confidence") intervals (CIs). Hence those statistics should not be taken to be as definitive as dichotomized "significance" declarations make them out to be. Even well-conducted RCTs are very often subject to uncertainties in endpoint adjudication, adherence, causes of withdrawals, competing risks, and so on that are simply not accounted for by the presented statistics - the models from which those statistics are derived do not account for every single aspect of the real (as opposed to the idealized) mechanisms generating the data. 

    On top of that, there is now a plethora of techniques for analyzing most any data, with competing claims for efficiency, validity, and robustness - claims that are often based on assumptions that cannot be reliably checked at realistic sample sizes. Yet the choices made among them can affect outputs noticeably. For example, how did the trial handle censoring? Ordinary partial likelihood? Or less efficient, more "sample-hungry" but more robust inverse-probability-of-censoring? It can matter quite a bit for both P-values and CIs. 

    It is no surprise then if well-informed sensitivity analyses reveal that the sharp numbers emitted from software should really be seen as fuzzy to some degree, which in turn reveals how decisions based (say) P falling just above or below a sharp cutoff (whether 0.05 or 0.005 or...) look haphazard and arbitrary. We can still use those decision rules, but in doing so we should not fool ourselves or our students or readers that this usage tells us the significance of results or the confidence we should have in them. That caution is especially important when there are arbitrary or uncertain assumptions behind outputs, for then a well-informed decision must examine decision robustness -how the decision changes under variations in assumptions, models, and methods.

    In the face of that reality, and shorn of philosophical commitments and mathematical idealizations, are you claiming that p=0.049 and p=0.051 (or 0.0049 and 0.0051) are scientifically different results, regardless of context and method choices, so that the first is always "significance" and the second is always not? That's where your arguments seem to me to be headed, at least so far.

    (As a footnote, I see more and more FDA approvals where there wasn't two RCTs with p<0.05, sometimes not even one.)

    All the Best,

    Sander



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 15.  RE: hypothesis formulation

    Posted 07-21-2023 11:37

    A very interesting discussion and, as always, I learn a great deal from Prof. Greenland's writings.

    I thought I would use this opportunity to put in a short plug for my former student Brian Segal's work (entirely his own) on exceedence intervals, which are confidence intervals for the probability that a parameter estimate will exceed a specified value in an exact replication study.  The idea has its roots in a Bayesian posterior predictive distribution setting, although the development is entirely frequentist.  Although no statistical method is a panacea, I thought this approach deserves more attention that it has received thus far.



    ------------------------------
    Michael Elliott
    University of Michigan
    ------------------------------



  • 16.  RE: hypothesis formulation

    Posted 07-21-2023 20:43

    Thank you Michael for the kind words and for the citation to exceedance intervals (of which I had not been aware).

    On quick glance the TAS paper by Brian Segal looks interesting, albeit demanding. Has Brian has followed it up with a more elementary primer illustrated with some toy examples and a simple but real application? Such a primer would help get the method into use. If a primer exists or is forthcoming, please let us know where it is posted.

    I did spot one small aspect of the paper that I would alter: On p. 130 it stated "For point null hypotheses, Bayes factors tend to be more conservative, that is, Bayes factors provide less evidence against the null hypothesis than p-values..." I see this kind of comment often, and I think it is misattributing a property of observers to a mere number obtained from a computation. P-values do not overstate evidence against the hypothesis H from which they are computed; rather, people overstate the evidence against H that p = 0.05 represents, thanks to the entrenchment of the 0.05 cutoff as a criterion for "significance". Bayes factors merely provide one way of seeing how little evidence p=0.05 represents.

    A straightforward non-Bayesian way of seeing that point uses an old teaching exercise: 
    Consider a coin-tossing mechanism and take H to be the hypothesis that the mechanism is not loaded (biased) toward "heads". Let p(n) = 2^-n be the P-value for H from seeing n heads in an experimental test of the mechanism comprising n tosses. Then p(4) = 0.0625 and p(5) = 0.03125, placing p = 0.05 closest in evidence to getting 4 heads in 4 tosses. I think most people would appreciate the weakness of such evidence if asked to bet substantial money against H based only on that result.

    This exercise can be extended to an observed P-value p for any hypothesis H by converting it to the binary surprisal or S-value s = -log2(p); p then equals 2-s. When s is an integer n, p equals the aforementioned p(n) from seeing n heads in the experiment with n tosses. The S-value s can also be seen as a measure of the Shannon information against H that the P-value conveys; the units of s correspond to bits of information, and p=0.05 represents only about s=4.3 bits of information against H. For contrast, p=0.005 represents 7.6 bits, and the one-sided 5-sigma criterion for "discovery" (rejection of a null H) in particle physics corresponds to about 22 bits against H, or 22 heads out of 22 tosses.

    My colleagues and I have found this conversion of P-values to coin tosses and surprisals to be very useful in stemming common overinterpretations of P-values. Thus, in addition to background theoretical papers justifying and elaborating the usage, we have published a number of introductory treatments for various fields, including (among others)
    Rafi, Z., Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20, 244. doi: 10.1186/s12874-020-01105-9, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9,
    online supplement at https://arxiv.org/abs/2008.12991
    Cole, S.R., Edwards, J., Greenland, S. (2021). Surprise! American Journal of Epidemiology, 190, 191-193. https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593
    Amrhein, V., Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values.
    Journal of Information Technology, 37(3), 316-320. https://journals.sagepub.com/doi/full/10.1177/02683962221105904

    I will look forward to a similar basic introduction to exceedance probabilities.

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 17.  RE: hypothesis formulation

    Posted 07-24-2023 10:30

    Sander,

    Always gets fun when you jump in the waters.

    I've always wondered about the interpretation of S-values,

    Consider a coin-tossing mechanism and take H to be the hypothesis that the mechanism is not loaded (biased) toward "heads". Let p(n) = 2^-n be the P-value for H from seeing n heads in an experimental test of the mechanism comprising n tosses. Then p(4) = 0.0625 and p(5) = 0.03125, placing p = 0.05 closest in evidence to getting 4 heads in 4 tosses. I think most people would appreciate the weakness of such evidence if asked to bet substantial money against H based only on that result. 

    This exercise can be extended to an observed P-value p for any hypothesis H by converting it to the binary surprisal or S-value s =h -log2(p); p then equals 2-s.

    The interpretation seems to correspond to a 1-sided p-value, right? But the example is a typical 2-sided setup, ie, we would be "surprised" if we got either n heads or n tails (in n tosses). So, p(5) = 0.0625 and p(6) = 0.0313, and the S-value is then -log2(p)+1.  From an informational theoretical standpoint, then, the 2-sided p-value would seem to carry 5.3 (not 4.3) bits of info against H.

    Trivial point, but I've found it a bit tricky to explain to (the rare) attentive students/practitioners. 

    Regards.



    ------------------------------
    Constantine Daskalakis, ScD
    Thomas Jefferson University, Philadelphia, PA
    ------------------------------



  • 18.  RE: hypothesis formulation

    Posted 07-24-2023 18:29

    Hi Constantine,

    Small point perhaps but not trivial, because as you found as I did that it raises a trickiness for teaching.

    Before explaining, allow me to correct your example:
    There are several ways to define 2-sided P-values. In the simple case of tossing with Pr(heads) = 0.5 and seeing all heads they all yield twice the 1-sided P-value I used for n heads in n tosses: 2-n; call that 1-sided P-value p.
    With n tosses all heads, the 2-sided P-value is 2p = 2(2-n) = 2-n+1 whose negative base-2 log is n-1 (not n+1). Thus the S-value from the 2-sided P-value is -log2(p)-1.
    With p = 0.05 we get 2p = 0.10 and s = -log2(2p) = -log2(p)-1 = 3.3; for reference, that is between the probabilities of 3 and 4 heads in a row, p(3) = 0.1250 and p(4) = 0.0625.

    After some waffling, a few years ago I decided that the most straightforward way of explaining the information content of actual observed P-values was using the all-heads example I posted earlier, as that applies to any input P-value, whether 1-sided or 2-sided or many-sided (like that from a test of model fit): The binary S-value provides one simple measure of the information in that P-value against whatever hypothesis or model is being evaluated. That is so even when adjustments or penalties have been applied to get the actual P-value. The coin-tossing formulation I use converts the actual P-value being evaluated into a 1-sided P-value in a reference experiment on coin tossing. This is exactly as is done in particle physics, in which P-values are converted to the one-sided standard normal cutpoint ("sigma") that would produce them as the upper tail area; here the reference experiment is a single draw from a standard normal distribution.

    In that description, the reference point for evaluation of a P-value is not described as the P-value testing fairness (which is 2-sided), but rather as the P-value for testing no loading (bias) for heads, which is one-sided. Fairness, Pr(heads) = 0.5, is used for the reference distribution because it is the closest one can come to bias in favor of heads without having that bias, and it is what people think of intuitively for a reference distribution when testing for loading in either direction (we should be grateful for and take advantage of any time that intuition leads to the correct statistical answer!). I went this route in part because using instead the 2-sided P-value for heads brings in complications that arise from disputes about 1-sided vs 2-sided hypotheses and tests, as reflected in the present thread. It is tricky to finesse those disputes and requires a lot of background to appreciate the details, all of which can be avoided if for a moment one forgets statistical theory (or doesn't have any) and just focuses on the probability of getting all heads in an experiment of n tosses to check for bias toward heads. 

    An historical aside: P-values in some form (not by that name) date back to the early 1700s. They were becoming popular and even hacked by researchers by the 1840s; by the 1880s they started to be linked to  the then-new concept of "statistical significance" (see Shafer, G. 2020. On the nineteenth-century origins of significance testing and p-hacking. http://www.probabilityandfinance.com/). Those were all one-sided P-values however, or at least I know of no reference to 2-sided P-values before Fisher, so I'd be curious if any exist; that they seem to have appeared only after two centuries and had to wait for someone like Fisher to popularize them might raise suspicions that they are less intuitive than the original one-sided P-value formulations.

    The problems of interpreting 2-sided P-values can be seen from an information-theory standpoint, where a 1 sided P-value of p = 0.0625 in a coin tossing experiment to check for bias toward heads would become a 2-sided P-value of 2p = 0.1250 from the same experiment, which represents only 3 bits of information against some hypothesis. But which hypothesis? Bias in either direction? Why check the tail direction when we saw all heads? And why this loss of 1 bit of information?

    There are several ways to answer these questions depending on one's preference or dislike for directional hypotheses with their 1-sided P-values vs. point hypotheses with their 2-sided P-values.

    For those who dislike 1-sided hypotheses and P-values, a direct 2-sided explanation takes 2p as giving the information against the point hypothesis H: Pr(heads) = 0.5, bypassing one-sided derivations. A 2-sided P-value of 0.1250 then represents only s = -log2(0.1250) = -log2(0.0625)-1 = 3 bits of information against that H. But this two-sided P-value arose from 4 heads in a row, which has probability 0.0625 under H. I think the 2-fold discrepancy between the P-value and the probability of what was taken as observed heads in a row is bound to confuse students!

    One-sided explanations for the discrepancy can avoid that immediate confusion at a cost of much more sophisticated arguments, as found for example in Cox's writings (SJS 1977) which expressed a preference for thinking of 2-sided P-values as derived by combining two 1-sided tests.

    To illustrate the informational view of that combination, first suppose we are given only that the 2-sided P-value 2p = 0.1250, not the direction in which the deviation occured.Then we can only say that one of the directional deviations has p=0.0625 but we don't know which one. With S-values we can say that there is one bit of missing directional information (the sign bit when the boundary point of H is 0, as with the logit of the heads probability). In the coin-tossing example, it is as if we are given only a one-sided p = 0.0625 but not whether that was from all heads or all tails, so we don't know whether the result is information against H: Pr(heads) ≤ 0.5 or against H: Pr(heads) ≥ 0.5. That is a loss of one bit of information.

    We do however ordinarily see the direction; in that case Benjamini described the use of 2p as a Bonferroni-type adjustment or penalty for picking the smaller of the two 1-sided P-values. Extending that to S-values, here is what I posted to Komaroff:

    suppose the side was not really prespecified and instead the data made the choice; then the two-sided penalty of doubling the smaller of the one-sided P-values corrects for that choice in a way familiar in multiple comparisons and information theory: doubling p results in a decrement of one bit in the surprisal, losing the direction bit (the data information used to make the direction choice): s = -log2(2p) = -log2(p)-1.

    I find it satisfying that both preferences lead us to the same answer of one bit for the information loss in going from 1-sided to 2-sided P-values. But again, for basic teaching this all seems to me to be worth bypassing by using the treatment I posted earlier in which the actual P-value being evaluated (regardless of its sidedness) is set equal to the probability of all heads in n tosses 2-n and the equation is solved for n (or more generally, s, the number of bits of information supplied by the actual observed P-value against whatever model was used to compute it).

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 19.  RE: hypothesis formulation

    Posted 07-24-2023 20:36

    Haha, yes, I should know better than to try to put my logic into even basic math on the run. Obviously, we can't gain info, only lose (hence more difficult to reject w/ 2-sided than w/ 1-sided). Thank you for the correction.

    Personally, I have a strong aversion to 1-sided testing in biomedical contexts (not least because I find it more amenable to misrepresentation and intellectual cheating). But that's a different topic.

    Cheers.



    ------------------------------
    Constantine Daskalakis, ScD
    Thomas Jefferson University, Philadelphia, PA
    ------------------------------



  • 20.  RE: hypothesis formulation

    Posted 07-25-2023 09:24

    Hi Kostas. We met at the Harvard School of Public health when I was a Research Scientist at ACTG. I recall your frustration teaching basic statistics to students in a GEN ED program. At that time, I could not commiserate, but now I feel your pain after formally teaching online and in-person classes on basic statistical practice for the past 13 years.  It is very hard to teach the foundational statistical tests and becomes harder when it comes to statistical modeling like mulitple regression, multivariate analysis, and beyond. 
      
    Regarding Professor Greenland's speculation about one and two tailed tests:  "An historical aside: P-values in some form (not by that name) date back to the early 1700s. They were becoming popular and even hacked by researchers by the 1840s; by the 1880s they started to be linked to  the then-new concept of "statistical significance" (see Shafer, G. 2020. On the nineteenth-century origins of significance testing and p-hacking. www.probabilityandfinance.com/). Those were all one-sided P-values however, or at least I know of no reference to 2-sided P-values before Fisher, so I'd be curious if any exist; that they seem to have appeared only after two centuries and had to wait for someone like Fisher to popularize them might raise suspicions that they are less intuitive than the original one-sided P-value formulations."

    First, take a look at the title in Gosset's (1908) brilliant, ground breaking logic and method for converting a population  standard deviation into a standard error, no doubt under the tutelage of Karl Pearson. 

    Gosset WS ("Student," 1908). The probable error of a mean. Biometrika 6 (1), 1–25. 

    Now, here is what Fisher (1973) said about the concept called probable error: "The value of the deviation beyond which half the observations lie is called the quartile distance, and bears to the standard deviation the ratio .67449. It was formely a common practice to calculate the standard error and then, multiplying it by this factor, to obtain the probable error. The probable error is thus about two-thirds of the standard error, and as a test of significance a deviation of three times the probable error is effectively equivalent to one of twice the standard error" (p. 45). 

    Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). New York: Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). New York: Oxford University Press. 

    Seems to me Professor Greenland wants us to believe that Fisher was nothing more than a social media influencer spreading misinformation. It is now clear to me that the statisticians who banned statistical significance, and I don't know who they are besides the three authors of the TAS (2019) editorial, also disparaged the statistical reasoning that preceded Fisher small sample theory and that certainly includes Pearson's large sample theory. 
     



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 21.  RE: hypothesis formulation

    Posted 07-25-2023 17:24

    Prof. Komaroff,

    I have looked at Student 1908 once more and see no 2-sided P-value in it. Thus I am not clear as to why you cited it.
    If there is a 2-sided P-value in it, please point us to exactly where it can be found.

    I am also unclear as to the purpose of your Fisher quote. I have often read that passage and others in scholarly articles attempting to explain the origin of the 0.05-cutoff convention. I have seen nothing in them however in which Fisher used the NP terms "alpha", "Type-I error" or "test size" to label or justify such cutoffs, even in his writings (such as your cite) long after they had become established in most Anglo-American statistics literature (apart from in his derogatory remarks about NP-Wald theory). If you have such a cite, please point us to exactly where it can be found.

    As for the 0.05 convention, both Fisher and Neyman separately (in their own terms) described the choice of testing cutoff as context dependent, e.g., Fisher said "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" (Statistical methods and scientific research, 2nd ed. 1959, p. 42). When I met Henry Oliver Lancaster in 1985, he recounted to me how, when asked if he regretted anything in his career, Fisher snapped back "Ever mentioning 0.05!".

    Fisher's regret might have differed if in his Statistical Methods for Research Workers he had given more primacy to his earlier usage, for example "If the value of P so calculated, turned out to be a small quantity such as 0·01, we should conclude with some confidence that the hypothesis was not in fact true of the population actually sampled" ("Applications of Student's Distribution", Metron 1925, p. 90). Meanwhile, Neyman's final writings were exceptionally clear about how his fixed alpha-level needed to be based on costs of errors (e.g., Neyman, Synthese 1977). So I think it safe to say they both would have rejected any attempt at universal claim for 0.05. It is thus completely unclear to me how your quote of Fisher about probable error bears on the issues I have been discussing.

    Regarding Fisher's contributions, it seems you are eager to attribute to me views that I do not have and are in fact antithetical to what I have been writing here and publishing for years.
    You wrote:

    Seems to me Professor Greenland wants us to believe that Fisher was nothing more than a social media influencer spreading misinformation.

    That comment is so the opposite of true that I had to struggle to understand why you would say that. I think you have misread as if critical of Fisher my statement that 2-sided P-values "seem to have appeared only after two centuries, and had to wait for someone like Fisher to popularize them, might raise suspicions that they are less intuitive than the original one-sided P-value formulations". My fault for being unclear: I meant that it took a genius of Fisher's stature to clarify the concept and importance of 2-sided P-values so that they could achieve wide adoption, for (as I explained to Constantine) 2-sided P-values are more difficult to understand correctly than are 1-sided P-values. That difference in difficulty can be seen from the fact that 1-sided P-values are easy to express as limits of (and in fact originated from) Bayesian posterior probabilities (see Casella & Berger, JASA 1987; reviewed in Greenland, S., and Poole, C. 2013. Living with P-values: Resurrecting a Bayesian perspective. Epidemiology, 24, 62-68. ), whereas 2-sided P-values pose a challenge to Bayesian interpretations (e.g., see Bayarri & Berger, JASA 1987). 

    Still, I think you might have read my remark correctly if you had been reading my posts carefully to their end and reading the articles I cite. Those contain quite favorable views of Fisher's ideas, and which start from preferring the informational foundation for statistics he promoted over the decision-theoretic foundation in NP theory, which he vehemently opposed. In fact in other posts I have classified my views as neo-Fisherian! For example, if you had read to the end of my reply to Constantine Daskalakis, you would have seen that for your example I expressed a preference for Fisher's 2-sided P-value as an information summary, even though (as I explained) the 1-sided P-value is dictated as the decision-theoretic summary in the strict NP-testing formulation given by Lehmann in TSH.

    I would point you again to a careful reading of the articles I have cited in this thread, including the recent pair in the Scandinavian Journal of Statistics,
    https://doi.org/10.1111/sjos.12625
    https://doi.org/10.1111/sjos.12645

    which also cite Karl Pearson's theory of statistical model checking as part of the foundation, and build on that and Fisher's concepts of information, reference distributions, and significance levels, and the refinements of those concepts developed by Cox and colleagues. I only depart in taking care to relabel their "significance levels" as P-values (a relabeling which was already starting to happen in the 1920s, as Shafer documents, and was adopted by Cox in his final book in 2011), and in distinguishing their tail-area P-values from the minimum-alpha P-values of NP theory.

    I was forced to understand the Fisher vs. Neyman distinction because I was schooled directly from Neyman himself, and even rebuked by him for expressing preference for the Fisherian approach - although his former students on the department faculty at the time - Lehmann, David, and Scott (my advisor) - tried to shield me from his ire. I also appreciate the analogous Bayesian distinction (operational-Bayesian decision theory is the Bayesian analog of NP-Wald decision theory; reference-Bayes theory is the analog of Fisherian reference frequentism) - in fact for two decades I traveled around the world giving Bayesian workshop. I think both these distinctions should be clarified in all statistical training; the frequentist vs. Bayes split is often emphasized but the information-summarization vs. decision split is typically neglected, leading to much confusion in teaching and practice.

    You also wrote:

    It is now clear to me that the statisticians who banned statistical significance, and I don't know who they are besides the three authors of the TAS (2019) editorial, also disparaged the statistical reasoning that preceded Fisher small sample theory and that certainly includes Pearson's large sample theory.

    With that you seem to confuse calls to relabel observed "significance levels" as P-values with calls to ban statistical tests and P-values.
    P-values are a central statistic in the Pearson-Fisher approach of computing and presenting tail areas of statistics (the "value of P" in Karl Pearson and Fisher) to evaluate statistical models or hypotheses.
    A major problem is that many books and tutorials also use "significance level" for the fixed design alpha of NP theory, resulting in widespread misinterpretation of P-values as if they were pre-specified alphas; such misinterpretations lead to profoundly miscalibrated inferences, e.g., see Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62–71.

    To mix up calls for careful terminology with calls for bans of methods is a complete and unwarranted confusion that many fall prey to, probably because there are many other authors who do want to drop the Pearson-Fisher methodology from usage, replacing them either with orthodox NP hypothesis tests (e.g., Lakens) or the test inversions called "confidence" intervals (e.g., Rothman), or else replacing them with Bayesian measures such as Bayes factors (e.g., Goodman). Among these extremes I find that it is the nonBayesians who seem to most misunderstand and dismiss Fisher and his use of P-values (his reputation among epidemiologists was badly damaged by his skepticism of the smoking link to lung cancer). 

    Very few journals have actually enacted any bans, and a cursory examination of prestigious medical journals will show "significance" as code for "p<0.05" is still the dominant convention. The one major improvement that these reform movements have produced is the routine presentation of interval estimates; I hope we would all agree that is good. What is under fierce debate is whether more reform is needed. I and many others say yes, but so far little in the way of further reform has been taken up in practice because there is little agreement on what should be done.

    I have promoted "safe" use of divergence (Pearson-Fisher) P-values, taking the baby steps that we be sure to call them P-values rather than "significance levels", call fixed cutoffs "cutoffs" or "alpha levels", and present P-values in continuous form, without reference to a cutoff - the reader can always insert their own cutoff (whether 0.05 or 0.005 or...). Both Lehmann and Cox recommended continuous presentation of P-values, as one could see by careful reading of their textbooks. Yet these proposals have been attacked by orthodox Neyman-Pearsonians and Bayesians alike, with special invective from the NP orthodoxy (for whom I am an apostate or heretic). My response is to take being attacked from both wings as a sign that I am on the right track, and a suggestion that I am hitting a special nerve in exposing an unscientific rigidity and resistance to reform in a statistical orthodoxy.
     
    Again, I have not called for "banning" anything. Instead, following my favorite statistical thinkers (e.g., Box, Cox, Good, Mosteller, Tukey), I call for understanding and carefully justified use of all approaches, along with wariness of confusions that such a toolkit philosophy can engender. For example, we need to be wary of identifying P-values and "confidence" intervals with posterior probability statements (they are often numerically similar or lead to the same decision, but their interpretations differ in important ways), or confusing Fisher's testing philosophy with Neyman's (they sometimes lead to the same numeric result or decision, but again their interpretations differ in important ways).

    All this means is that we should teach the information vs. decision distinction just as we do the frequentist vs. Bayesian distinction. Crossing these distinctions leads to a 2x2 table of questions and tools for answering them, with Information-summarization vs. Decision goals on one axis and Calibration vs. Predictive goals on the other. Elaborations to more rows,  columns and dimensions will no doubt be needed, but I think that teaching these distinctions is a start toward addressing the practice problems we lament.

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 22.  RE: hypothesis formulation

    Posted 07-26-2023 05:54

    Dr. Greenland. Please forgive me if my remarks are unwarranted and offensive. You are a profound, theoretical statistician with an extensive and impressive publication record. You have earned the respect and the well-deserved reputation as a scholar and teacher not only from me, but from an entire lively but contentious world-wide community of statisticians. In fact, I dreamt you were Goliath, and I was David but had no stone in my pocket. I truly am honored but intimidated by your interest in my humble musings.

    I am asking you to simply help me understand your pushback to my statement:  An inequality in the null hypothesis is conceptually understandable as a one-tailed test, but mathematically is impossible. This statement is true because I believe in the theory of sampling distributions. Let's completely remove the equality to minimize confusion and state H: d < 0.  Please show me a computer program or tell me the statistical software that I can use to evaluate your one-tailed inequality hypothesis. 



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 23.  RE: hypothesis formulation

    Posted 07-28-2023 20:15

    Dear Eugene,

    Please forgive my tardy reply - I have had to attend to other matters over the past few days.
    Also, with regrets I may have to delay response to your other (reposted) list until the weekend...

    I am of course deeply flattered by and thank you heartily for your too-kind remarks. I confess was most surprised given the earlier parts of our exchange, so I was at a loss at how to respond. As for being a Goliath, the proper term might instead be dinosaur.

    I should say (and perhaps should have said sooner) that I have seen your work in the past and thought it was eminently sensible (which is the highest compliment I know of for a scientist or engineer, including statisticians among those). Furthermore you seem to be operating from views not far from mine. So I was taken aback at the contentiousness and the confusion of my points with more radical views and proposals, especially as I have been a staunch defender of P-values and neoFisherian (informationalist) ideas against attacks from all sides (NP, likelihoodist, Bayesian).

    Also, I have had trouble understanding some of your statements - it seems as if we speak different dialects, leading to misunderstandings when words are the same but their meanings are shifted (as in "false French friends" or other cognate confusions, illustrating the importance of semantics in discussing statistics):

    I am asking you to simply help me understand your pushback to my statement:  An inequality in the null hypothesis is conceptually understandable as a one-tailed test, but mathematically is impossible. This statement is true because I believe in the theory of sampling distributions. Let's completely remove the equality to minimize confusion and state H: d < 0.  Please show me a computer program or tell me the statistical software that I can use to evaluate your one-tailed inequality hypothesis. 

    I simply could not fathom what you meant by that passage. What is mathematically impossible?
    Also I am unclear why you are dropping the boundary point of zero from H, although assuming continuity of the parameter, statistics, and distributions I believe this only means we'd have to shift from minima to infima in some technical descriptions, so for now I can accommodate it.

    With all that continuity in place, then, as I wrote before and unless I have made a mistake, the standard one-sided P-value for d=0 provides a valid (i.e., size≤α for all d in R(H) = {d: d<0}) test of H: d<0 via the NP test (decision rule) "reject H if p≤α" and its distribution dominates a uniform variate if H holds. Are you claiming that this P-value or test of H is not valid? 

    Under continuity that one-sided P-value is the Lehmann (NP) decision P-value for H; but it's not the divergence P-value, which is instead twice that and thus equals (but is not defined as) the usual Fisherian two-sided P-value for d=0.

    The rest of this post just elaborates on points I covered earlier in this thread, offered only in case there is any residual misunderstandings about my goal in answering you with NP theory: it was simply to show that, in terms used by the most entrenched system in American statistics of the latter 20th century and used throughout journals and policy, it is quite possible and often easy to test an H defined by inequalities (as shown by Hodges & Lehmann, AMS 1954).

    My use of NP tests to respond should not however be taken as an endorsement of NP: quite the contrary, I prefer neoFisherian divergence ideas for the kind of problems I have encountered in health and medical sciences. Those ideas can also generate a P-value for H defined by inequalities, often just by switching to maximization over H of two-sided P-values; I don't like to call those P-values "tests" however because that might suggest they are part of NP theory, which is inappropriate here. Divergence P-values produce valid tests but those are less powerful when they differ from NP-optimal decision P-values. I would rather see divergence P-values described as indices of compatibility between H and the data given background assumptions - or from a model-checking view, compatibility between a specific model M and a more general, less restricted model A, in light of the data. For more of the theory see sec. 2 and the Appendix of the Greenland 2023 SJS main paper.

    Now to repeat some laments from earlier in our thread, in tendentious and perhaps tedious detail:

    I was forced into the role of a P-value defender when the journal Epidemiology (on which I had been one of the founding editors in 1990 and has since become one of the top journals in its field, especially for epidemiologic methods) banned display of P-values for parameters, a move I protested without success. Since then I have been involved in dozens of articles aimed at instructors and researchers about how to teach and use P-values in ways that I found help avoid the misuse that P-critics complain about. 

    An ideologically diverse and contentious group of colleagues still managed to agree enough to catalog major misuses in TAS 2016 (Greenland, Senn, Rothman, J. Carlin, Poole, Goodman, and Altman). We advised presenting P-values as the numbers they are, not as inequalities like "p<0.05" (which can be done even if their interpretation makes reference to alpha levels), a move advised by authorities both from the NP tradition (e.g. Lehmann) and from the Fisherian tradition (e.g. Cox). 

    We all knew how common it remains that "statistical significance" or lack thereof is confused with practical significance or lack thereof, and how common it remains that of P-values are confused with alpha-levels, probably because both get called "significance levels". These confusions can be somewhat mitigated simply adopting long-standing, more precise terms in place of terms using "significance" or "significant". I later teamed with other colleagues to repeat that advice in several articles starting with Amrhein, Greenland and McShane in Nature 2019. Dishearteningly, that advice to change to less ambiguous yet familiar labels was promptly attacked and confused with calls for banning tests and P-values.

    Among other terminology reforms that we have advised are to replace talk of "significance" and "confidence" with compatibility, a usage that can be found in Fisher, and which by the start of this century could be found in several other worthy sources; and to replace "null hypothesis" with "tested hypothesis" (as Neyman did) or with "test" or "target" hypothesis, unless indeed the hypothesis is that a parameter is zero or that some variables are independent. We have also advocated teaching devices to aid perception of information by plotting P-values, and by transforming probability statements into physical experiments and natural frequencies, as Gigerenzer and colleagues demonstrated effective in many educational experiments - but that is another long story.

    It is discouraging to see how such simple constructive reforms to address calls for bans are resisted, with some critics writing as if we had made up these ideas (we merely compiled and blended them from across a vast literature stretching back to Pearson 1900), and as if the replaced terminology is sacred tradition (imagine defending offensive ethnic terms on grounds that it is only semantics and those terms can be used properly by those who are trained adequately). The result has been little change so far and thus continuing confusion among researchers, hence more calls for bans - some of which have been successful. Regardless of divergent philosophical stances about statistics, we need to constructively address critics with genuine changes, not hold onto what are often arbitrary traditions as if they reflect the soul of statistical science.

    Best Wishes,

    Sander



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 24.  RE: hypothesis formulation

    Posted 07-30-2023 11:55

    Hi, Sander:

    Your main paragraph is this one:

    With all that continuity in place, then, as I wrote before and unless I have made a mistake, the standard one-sided P-value for d=0 provides a valid (i.e., size≤α for all d in R(H) = {d: d<0}) test of H: d<0 via the NP test (decision rule) "reject H if p≤α" and its distribution dominates a uniform variate if H holds. Are you claiming that this P-value or test of H is not valid? 

    This seems to work OK if all we care about is a decision rule for rejecting the null AND if the p value is in fact <= alpha.

    How does your rule work if p > alpha, say p = 0.07, do not reject the null? It seems like it no longer provides a valid decision for cases in which d < 0, say d = -1 (and the corresponding p = 0.01).

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 25.  RE: hypothesis formulation

    Posted 07-30-2023 12:26

    Ed, my understanding is that it doesn't work that way.

    If we have H0: d<=0 vs. H1: d>0, then that null is equivalent to H0: {d=0 OR d=-1 OR d=-2 etc.}, ie, the union of point nulls for an infinite number of d's. Therefore, the alternative H1 (rejecting the H0) is the intersection of the alternatives for those point nulls.

    The p 0.01 you cite is the p-value for testing one single of those values, d=-1. Even though it rejects d=-1 (at alpha 0.05), it is not sufficient to reject H0: d<=0. For that, you must reject ALL possible values within the H0 space. Hence, the corresponding p-value is the maximum of the p-values corresponding to testing ALL the d's in the H0 space. In this case, it is the p-value for testing d=0.

    I'm sure Sander or someone else will correct me if I'm wrong.

    Hope to see you in Toronto.

    Best,

    Constantine



    ------------------------------
    Constantine Daskalakis, ScD
    Thomas Jefferson University, Philadelphia, PA
    ------------------------------



  • 26.  RE: hypothesis formulation

    Posted 07-30-2023 13:34

    Hi, Constantine:

    Thanks. that makes sense, assuming we accept the logic that you must reject all values in the null hypothesis to reject it. So if we would reject 99% of the values in that range, but cannot reject 1%, we fail to reject the null. Hm.

    To me it makes more sense to use the null as d = 0 and argue that the logic of a one-tailed test is that d < 0 is a priori ruled out so we aren't testing them.

    I will not be at JSM this year. Will miss seeing all of you!

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 27.  RE: hypothesis formulation

    Posted 07-31-2023 04:48

    Hi Ed,

    To answer your latest posts, let me start by saying that what "makes sense" depends on the actual, real-world question being addressed and how that question is made precise enough - that is, how it is mapped into a formalization - so that deductions of the sort found in math stat can be applied to select a computationally feasible data-analysis procedure. Translation of the output of the chosen procedure back to the real world needs to step backwards carefully through the mapping into the math. 

    The translations from the real application to the formal method and back to the application are not addressed by math stat or by most of what is called statistical theory and foundations. Yet the translation steps are critical in determining the extent to which statistical analysis helps or hinders the application (whether the goal is information summarization or decision or whatever).

    Thus, we cannot say a statistical procedure makes sense or is preferable to another without an application context, and a map between real objects and actions in the context and the math objects and rules in the procedure.

    The 1-sided vs 2-sided issue illustrates these points...
    Real questions about treatment effects have forms along the lines of these oversimplified examples:

    1) Context: I don't want to switch to treatment T1 unless it does better than the one I currently use, treatment T0, because switching is costly. So I ask:
    Does treatment T1 do better than treatment T0? This is a 1-sided question.

    2) Context: I have a choice before me between T1 and T0. So I ask:
    Are T1 and T0 equivalent for practical purposes? This is a 2-sided question.

    3) Context: I am not faced with a choice between T1 and T0; instead, I want to gather information to inform choices between T1 and T0. So I ask:
    How much do treatments T1 and T0 differ? This is unsided - it's asking for an estimate.
    In moving from (1) to a formal test, note that there is no assumption in the question that T1 doesn't do worse than T0. So, why insert such an assumption? We don't need it to formalize the question in a way that parallels the words. If I cannot demonstrate that T1 and T0 differ in the direction for preferring T1, I'm done: Stay with T0. There's no need to test anything worse than "no difference".

    This informal application logic is reflected formally when we map "T1 does better than T0" to δ>0:
    Let p(d) be the upper one-sided p-value for H(d): δ=d. Then, given the data, p(d) increases with d.
    Thus p(0) will be the largest p(d) of any d in the interval (-,0], that is, p(d)<p(0) for any d<0.
    Hence p(0), the one-sided P-value for H(0): δ=0, is equal to p(-∞,0], the P-value for the interval hypothesis that d is in the half-interval (-,0].
    There is no need to assume that d<0 is ruled out (unless d<0 corresponds to something physically impossible), which is a good thing because that assumption isn't part of the context; but H(-∞,0): d<0 will automatically get rejected if H(0): d=0 gets rejected.

    In moving from (2) to a formal test, most of the time I see that done by using a 2-sided P-value 2min(p(0),1-p(0)) to test H(0): δ=0. As Hodges and Lehmann AMS 1954 pointed out, this point hypothesis test doesn't correctly answer question (2) because (2) is framed in terms of practical differences, not "no difference at all" (which allows not even an epsilon-small difference; H(0) is δ=0 exactly). So they proposed instead an NP-optimal decision rule for (or test of) the interval hypothesis H[-r,r]: -r ≤ δ ≤ r, where r is the radius of equivalence.

    Unfortunately the HL rule leads to a type of NP P-value incoherency (Schervish, TAS 1996). One response to that problem is to switch to using divergence P-values as described (for example) in Greenland SJS 2023; but this involves switching from testing to descriptive use of P-values. That in turn leads to recognizing that many applications approached as if testing issues as in (1) and (2) are more appropriately treated as estimation questions as in (3). Of course, estimation can be done with P-values: A point estimate is simply the point d at which the 2-sided P-value for H(d): δ=d is maximized (assuming continuity, the maximum P-value is 1), and a 2-sided α-level compatibility interval (CI) is the interval of points d for which the 2-sided P-value for H(d): δ=d exceeds α. When the P-values are from likelihood-ratio (deviance) statistics, the resulting point estimate is the MLE and the CI is a likelihood interval.

    I hope all that makes sense,
    Sander


    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 28.  RE: hypothesis formulation

    Posted 07-31-2023 08:41

    Hi, Sander:

    Thank you for your thoughtful reply.

    Your scenarios 2 and 3 are interesting, and I read them with attention, but they are more complex than my example, so I will let others comment on them.

    As for scenario 1, I think we need to consider this from two angles. First, is a one-tailed test in this scenario the best approach logically? Second, is it the best approach practically?

    Logically it may be, under ideal theoretical circumstances, notably if it is true that the only meaningful scientific question is whether T1 is better, with T0 being better having no interest. And this decision was made in advance of the study for purely logical reasons.

    In that case one could define scenario 1 as a one-tailed test. Ultimately, the analysis should be designed to answer the question.

    In practice, I would not be comfortable going there, or encouraging my students to go there. I would worry about data analysts basing the one or two tailed decision on what is of interest, rather than what is scientifically plausible. It involves too much judgment. And unless it was clearly planned in advance (in a protocol that is published) it carries a risk of bias.

    My sense from studies I have seen is that many (perhaps, most) researchers and statisticians use a 2-sided test when there is merely "interest" in one direction. Since it is scientifically plausible that the difference favors T0, they conduct a 2-sided test to take it into account.

    Also, for scenario 1, I think that as was true for (2) few people would see the decision question as merely better or not. They would say, "I will switch to T1 if it is better than T0 by at least r". Then you need a 95% CI that excludes r in favor of T1 to make the switch. I can live with that kind of one-sided difference. It has a more solid feel to it than simply "better"!

    unrelated question -- is anybody else experiencing multiple timeout errors when trying to reply to the discussion board?

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 29.  RE: hypothesis formulation

    Posted 07-31-2023 13:31

    Hi Ed,

    Thanks for your thoughtful reply. To respond to 1,3,2 out of my original order, plus add a Bayesian observation:

    1) We agree completely about reformulating (1) as "I will switch to T1 if it is better than T0 by at least r". Nonetheless I would regard a two-sided 95% CI as inadequate for answering the question unless the α-level was clearly and appropriately set at 0.025 given the actual cost of moving to T1 when it isn't better by at least r.
    In the spirit of testing for superiority, I'd instead advise giving the one-sided P-value p(r) for H(r): δ≤r. The recipient can then compare p(r) to any cutoff they want. Again, that's advice given by Lehmann and by Cox among others.

    More generally still, suppose I don't want to presume I know either the cutoff α or the minimum difference r that the recipient wants to use. Then I could supply a graph of the one-sided P-value function p(d) against the difference d. Recipients of this information can now pick whatever cutoff α and minimum improvement r they want: they just look at the d=r point on the d-axis to see if p(r) ≤ α. They can even forego a decision or choice of α or r, thus deferring treatment choice. This general approach goes back at least to the confidence distribution concepts of Cox AMS 1958 and Birnbaum JASA 1961, although both Cox and Kempthorne attribute it to Fisher 1930.

    3) I don't think estimation (3) is more complex than one-sided testing; it just involves reading the graph of the P-value function by entering along y-axis instead of the x-axis to find d:
    The point estimate is the d at which p(d) = 0.5, and a two-sided 1-α CI has boundaries at the points d for which p(d) = α/2 and p(d) = 1-α/2.

    In light of the preference for 2-sided P-values today, more often we graph instead the two-sided P-values p2(d) = min(p(d),1-p(d)). The point estimate is then the d at which p2(d) = 1, and a two-sided 1-α CI has boundaries at the points d for which p2(d) = α. That is graphed for example in Rafi & Greenland 2020 (https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9), along with a profile-likelihood function for comparison.

    2) We can also answer question (2) from the graph of p2(d), but that question (and the mirror question seen in equivalence testing) raises distinctions and technicalities that I've alluded to earlier and discussed at great length in the SJS 2023 paper. So for now I'll just say that I think the topic of evaluating practical differences is of such importance that it belongs in basic training. That evaluation can be done by describing an interval of no practical difference [-r,r], leading to an interval hypothesis H[-r,r]: -r≤δ≤r which can be evaluated by providing a P-value that is the maximum two-sided P-value in the interval; that max-p is below α if and only if the two-sided 1-α CI falls entirely outside the interval. Simplistic though it is, that approach is far more realistic than the current oversimplified default of testing H(0): δ=0, which is the same as setting r=0 and thus treating even the slightest difference as having practical importance.

    4) One more point, which connects P-values to Bayesian analyses: We can interpret the one-sided P-value function p(d) as an approximate posterior cumulative distribution for δ under a "relatively flat" prior that carries negligible information compared the likelihood function. There are many ways to formalize the idea of a negligibly informative prior, an idea which has generated volumes of often highly technical reference-Bayes writing. I have never seen a real problem in which the choice among them made a practical difference (apart from the fact that some of choices involve far more computation than they are worth). The best known is the Jeffreys invariant prior, but when using logistic, loglinear and proportional-hazards models the symmetric log-F(a,a) priors with degrees of freedom of a=2 or less are easier to interpret, require no special software or posterior sampling to produce posterior intervals, e.g., see Greenland & Mansournia SIM 2015 (Penalization, bias reduction, and default priors), https://onlinelibrary.wiley.com/doi/10.1002/sim.6537

    Best,
    Sander

    P.S. I'm getting time-out errors too. I guess we should contact the ASA office in charge.



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 30.  RE: hypothesis formulation

    Posted 08-01-2023 14:10

    Professor. Greenland. Our discussion started with my objection to an inequality sign in a null hypothesis statement. You offered the equivalence test as an example of an interval null hypothesis and sent me to Hodges and Lehmann (1954) for an explanation. I liked their first sentence from their Summary (abstract) below, but stopped reading after the last sentence.  

    "The distinction between statistical significance and material significance in hypotheses testing is discussed. Modifications of the customary tests, in order to test for the absence of material significance, are derived for several parametric problems, for the chi-square test of goodness of fit, and for Student's hypothesis. The latter permits one to test the hypothesis that the means of two normal populations of equal variance, do not differ by more than a stated amount"( Hodges & Lehmann, 1954, p. 165). 

    The first sentence resonates to the present day. The conflation of statistical significance with substantive significance needs to stop immediately. These concepts are related but not identical. At the end, the mention of Student's hypothesis and the words "differed more than the stated amount" is familiar. Student's hypothesis is most likely his innovative, small sample, standard error that replaced the population sigma in the large sample z-test.  The "stated amount" is called the "margin of equivalence" today. 

    BTW, researchers struggle to postulate a reaonable margin of equivalence for a sample size calculation. They have the same difficulty coming up with reasonable alternative parameter.  Their response: if I knew that, I would not be working on this grant proposal. 

    To debate whether one should use a p-value or confidence interval to test a point null hypothesis is waste of precious mental energy. Fisher and Neyman were both right!  I prefer p < α for statistical significance because it is easier than making sure that the point null parameter is not included in a 1-α confidence interval. 
     
    It appears your dislike of point null hypothesis stems from the well documented (by you and others) of the blatant abuse/misuse and/or naïve misunderstanding of the concept of statistical significance. However, "statistically significant - don't say it and don't use it" is not the solution - proper education is the cure. On the other hand, a ban on the ridiculous misinterpretation of statistical significance as substantive significance is urgently needed. This flawed conflation has been forcefully magnified by research articles in scholarly, peer-reviewed journals.



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 31.  RE: hypothesis formulation

    Posted 08-01-2023 18:20

    Dear Eugene,

    From your response below, I'm afraid it looks to me like you continue to not read thoroughly or at least not understand what I am writing here or the citations I provide.

    I suppose we are all too busy with other more important items, so I will try to brief this time:

    1) I use P-values for point hypotheses (as well as for interval hypotheses) all the time, and not just for point null hypotheses. That can be seen in the citations I sent, for example in my recent SJS articles, where the P-values for point hypotheses turn out to be divergence P-values when we work within everyday models (like GLMs). So please don't confuse my calls to use them properly with some "dislike" of them. P-values and tests for point (dimension-reducing) hypotheses are simply more difficult to use properly than are those for one-sided hypotheses, at least in the continuous models that are the mainstay of everyday statistics (including normal-theory tests like Student's t, as well as GLMs), because in "soft sciences" like ours those models and the point hypotheses in them are almost never exactly correct.

    2) You wrote " 'statistically significant - don't say it and don't use it' is not the solution - proper education is the cure". Once again, it seems you have failed to distinguish between what Wasserstein, Schirm & Lazar (TAS 2019) meant by that italicized phrase from much more radical calls to ban all P-values as well as NP hypothesis tests, which some journals have enacted. At least one journal has even banned CIs, noting how they are simply inversions of statistical tests; I think that ban is quite harmful, even antiscientific.

    In response to such bans, we (my colleagues and I) have in many articles described concrete teaching devices and practice reforms to preserve P-values and CIs while addressing the extensive abuses that we have documented (and which you agree exist). Yet, for decades now, some statisticians (it seems mostly senior ones) have been defending "statistical significance" against such reforms with statements like "proper education is the cure" while offering no hint at what constitutes "proper education", how they have implemented it, and how it could prevent that abuse; thus, perhaps unsurprisingly, the abuse has remained rampant. 

    I have not seen anything from you on how you address the abuse problems. Do you have any constructive reforms and educational devices you wish to share with us?

    On that topic, you have not responded yet to my responses to your list of statements from last week. I ended my response with an important question to you, which slightly expanded is:
    Are you claiming that p=0.049 and p=0.051 (or 0.0049 and 0.0051 or whatever cutpoint straddling you choose) are scientifically different results, regardless of context and method choices, so that the first is always "significant" and the second is always not, given the cutpoint?

    Finally, a more general question for you: Do you agree with anything I've written? If so, I would welcome a list so that any future exchange can focus on matters of disagreement.

    Best Regards,
    Sander



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 32.  RE: hypothesis formulation

    Posted 08-02-2023 15:24

    I have enjoyed this discussion and have, up to know,  been delighted to stand on the sidelines and learn. But let me add two small things to the discussion that may be of interest:

    1) one-tailed vs. two tailed tests - as a graduate student I remembver John Tukey once exclaiming "don't ever invent a test, because if you do someone will surely ask for the one-tailed values." He was then asked "do you mean you should never do a one-tailed test?" "No" he replied," it depends on who you're talking to -- some people will believe anything."

    What was he getting at? The key idea is that if you are willing to reject one hypothesis because it is very unlikely given the data you observed (forgive this Bayesian view -- a more frequentist statement might be because the data observed are unlikely given that hypothesis) you should also reject a similarly unlikely event at the other extreme. Let me offer one example: a chi-square is ordinarily thought of as a naturally one-tailed test, but there is the other other tail (a very short one, for sure) that might correspond to the data fitting too well -- better than you would expect. So, for example, had a two-tailed test been done of Cyril Burt's twin data we might have uncovered his fabrications much sooner.

    More on this is in a 50 year old paper by my favorite author:

    The other tail. The British Journal of Mathematical and Statistical Psychology, 26, 182-187, 1973.

    2) If hypothesis testing is not held to traditional binary structures too rigidly some interesting alternatives emerge. One example that comes to mind (its origin was, I think, Fred Mosteller, but it too could've been Tukey). Consider a binary set of hypoteses on population means -- say Ho: mean 1 = mean 2 vs. H1: Mean 1 unequal to mean 2. We all know that the likelihood of two means being exactly equal is usually vanishingly small, and if we just had a big enough sample we could show it. So why bother doing the experiment since we know that with a better (big enough) experiment we could reject Ho? So instead we switch to a trinary set of hypotheses H1: Mean 1 > mean 2, H2: mean 1 < mean 2, H3: we don't  have enough data yet to tell. I have long felt that the theoretical flexibility represented by this sort of thinking brings the formal world of hypothesis testing closer to the real world we live in. 

    My mechanic was telling me that he had to leave work early to buy a lottery ticket. I told him that his chances of winning were the same whether he bought a ticket or not. This is one example of trinary hypothesis testing.



    ------------------------------
    Howard Wainer
    Extinguished Research Scientist
    ------------------------------



  • 33.  RE: hypothesis formulation

    Posted 08-02-2023 22:26

    Hi Howard,

    Glad to read you have been enjoying these exchanges. Of course they are just a few of thousands of articles, letters and posts on the same issues since at least Pearson 1906 complained about misinterpretation of a large value of P as if it meant "no difference" (a misinterpretation which prevails to this day in various journals) - see Pearson K, 1906. Note on the significant or non-significant character of a subsample drawn from a sample. Biometrika 5, 181-183, 
    https://www.jstor.org/stable/2331656

    1) One vs. two-tailed tests: The χ2 goodness-of-fit statistic provides good examples in which the lower tail is relevant and others where its relevance is unclear...

    The lower tail becomes relevant when underdispersion is considered a possibility. The examples of that I've seen are mostly where fraud is suspected, as in Fisher on Mendel's data or your example of Cyril Burt.

    Nonetheless, there are many χ2 goodness-of-fit examples where only the upper tail appears relevant, as I have usually encountered in my work. Example: Checking fit of an outcome model used to estimate a specific hypothesized effect in secondary analysis of a large medical database. The model is just a device for adjusting covariates, and so the χ2 check is only a side diagnostic for lack of fit - if the model fits very poorly the adjustment may be inadequate even if all the necessary covariates have been entered in the model. In this setting, suspicions of underdispersion would be hard pressed to find a person, motive, and means to push the fit statistic far into the lower tail, especially when the main research question was formulated after data collection. Fraud, if present, would not have been aimed improving the model fit, but rather at shrinking or inflating the targeted effect estimate - the direction depending on whether one was trying to hide or produce evidence of an effect.

    Too bad that 1973 article by your favorite author is paywalled.

    2) I agree completely that the field of statistics would have been far better had the decision-theoretic convention been trinary (accept/indeterminate/reject) instead of the binary tradition that has gripped teaching and practice - a tradition I identify as a form of dichotomania (a term which can be traced to the 1940s). Nonetheless, in the vast majority of real applications I have encountered, a continuum of options ranging from complete compatibility with the data (p=1) through complete contradiction of the data (p=0) looked closer to what would help the consumers most.

    3) Your lottery example raises a few points I've seen in cognitive psych studies of why people buy lottery tickets, among them: 
    Holding the tickets along with subsequent news about winners has entertainment value which is not out of line with the cost of other entertainment purchases;
    the choice of ticket purchases is a nonegative interger choice rather a binary choice; and,
    in a world where payouts can now reach up to two billion dollars, the expected loss from a purchase is not as great as the ticket cost makes it appear.

    Viewing the last point from an individual perspective, your story might have made a good Seinfeld episode in which the statistician customer offered your remark and the ticket the mechanic then bought anyway won him a billion dollars.

    All the Best,
    Sander


    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 34.  RE: hypothesis formulation

    Posted 08-03-2023 11:27

     

    Dear Howard:

     

    I am not sure about your #3 point.

     

    You say,

     

    So instead we switch to a trinary set of hypotheses H1: Mean 1 > mean 2, H2: mean 1 < mean 2, H3: we don't  have enough data yet to tell. I have long felt that the theoretical flexibility represented by this sort of thinking brings the formal world of hypothesis testing closer to the real world we live in. 

     

    To start off, H3 is not a hypothesis. Hypotheses refer to the state of the universe (true parameter value), not the type of conclusion we draw based on (limited) data. Also, for completeness the = has to go somewhere, although for continuous distributions, it doesn't make any difference whether we put it in H1 or H2.

     

    Perhaps you are thinking of a decision rule tritomy, ie, you mean to test hypotheses

       H1: Mean1 >= Mean2   vs

       H2: Mean1 < Mean2

    but, instead of the dichotomous significance yes/no, we should adopt a decision tritomy (decide H1, undetermined, decide H2).

     

    Even so, you now just displaced the problem from the significant/non-significant boundary to the 2 boundaries in the tritomy, ie, decideH1/undetermined and decidedH2/undetermined. It's the same problem, but now at 2 places!

     

    IMO, the problem is inherent to the decision analytic perspective. If you want to come up with some sort of decision, you'll always have to draw a line somewhere to distinguish between different types of decision, and then that line invites vigorous debate (on relevance, prior beliefs, errors, costs, etc.). On the other hand, estimation gives each individual a "best" guess and the degree of uncertainty associated with it, but stops there, letting each person take the next step of making a decision or forming a belief individually. Many people typically don't like that because

    (1) they don't have enough skills to take that next step, and

    (2) psychologically, they prefer to be given a hard and fast black/white rule that they can follow.

    Hence the widespread preference for decision rules (eg, statistical significance) vs. pure estimation, IMO.

     

    Finally, I also think your statistician would be rather foolish to make the statement

     

    My mechanic was telling me that he had to leave work early to buy a lottery ticket. I told him that his chances of winning were the same whether he bought a ticket or not.

     

    Obviously, the chance is exactly 0 if you don't buy a ticket and some positive non-zero value if you do. If we want to make a statement about the state of the world, the statistician's statement is patently nonsense.

     

    Furthermore, even if you mean that, FOR YOU, that non-zero chance is close enough to 0, so that you feel it's the "same", the statement seems to assume that your implicit and unstated "equivalence boundary" and your judgment about its relative value in your life is universal and the same as the mechanic's (cost function of the errors). Why would it be so? If you replaced "same" with "not meaningfully different" or even better with "the very small chance of winning is not worth buying a ticket" in your statement, it would be much clearer why the mechanic might (logically and justifiably) disagree.

     

    Best regards,

    Constantine

     

     

    ______________________________________________________________

            

    Constantine Daskalakis, ScD

    he/him/his (hear name)

    Professor

    Div. of Biostatistics

    Dept. of Pharmacology, Physiology, and Cancer Biology

    Thomas Jefferson University

    Edison Bldg #1749, 130 S 9th St, Philadelphia, PA 19107

    (215) 955-5695

     






  • 35.  RE: hypothesis formulation

    Posted 08-03-2023 13:07

    It is a rare joke that can survive clinical dissection.

    Most people found:

    "his chances of winning were the same whether he bought a ticket or not"

    very funny. I'm sorry you didn't get the joke.

     

    Obviously, I need to learn to write more clearly – my point about trinary hypothesis testing – which, judging from your response, wasn't clearly made – is that we would be rewarded by departing from the too dogmatic adherence to a set of  formal rules established a century ago. Mosteller's (or was it Tukey's?) suggestion that I relayed is but one example. I'm sorry that you missed the point – I'm sure the blame is mine.

     

    Howard Wainer

     






  • 36.  RE: hypothesis formulation

    Posted 08-03-2023 15:22

    Thanks Howard -

    Were you replying to me or to Constantine, or maybe to both of us? I wasn't sure.

    If to me or to both of us:
    I thought I did get the joke, but maybe I didn't...
    My apologies if the humor in my response may have been too dry;
    I can only hope it worked at least for Jerry Seinfeld and Larry David fans.

    I thought I was agreeing with you about trinary testing. I was merely adding that I thought it even better to allow for even more possible potential decisions, for example as when one has to choose among treatment doses.

    I certainly agree that we would be rewarded by departing from dogmatic adherence to a set of formal rules established a century ago; I think that's a notion behind what I've written in the earlier posts here and in the citations I've given.

    Finally, I hope we also agree that the ongoing debate would benefit from more of the very practical wisdom of Mosteller, Tukey and the like.

    Best,
    Sander



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 37.  RE: hypothesis formulation

    Posted 08-03-2023 15:49

    Hi Sander,

    I was replying to Constantine (I don't believe I have ever met him).

    I don't think you and I disagree at all on any of this.

     

    Two additional things:
    1. I just remembered Tukey's whole remark about 1 tailed tests (I underline the part I left out previously)

    "exclaiming 'don't ever invent a test, because if you do someone will surely ask for the one-tailed values. If there was such a thing as a half-tailed test they would ask for those values too'."  (I hope no one now starts discussing how a half-tailed test might work – Tukey, in his own way, was making a joke.

     

    2. The focus on rigid adherence to certain statistical testing dogma in the face of the enormous variation in the quality of data gathering reminds me of what we see each week in the NFL. A play is run and the referees unpile a large number of very big men and then plunk down the ball in the place they believe represents its forward progress. Then they haul out a 10-yard-long chain and measure to the nearest millimeter to see if enough yardage has been gained to yield a first down. We statisticians represent the chain and the referees the subject matter scientists. Being overly precise on our end doesn't make a dent in the precision of the entire enterprise. We would be better off trying to adapt our methods to suit the situation and thus provide more light on the problem -- maybe adapting the methods used so successfully in tennis to judge whether a ball is in or out has an analog in football? The idea is to look at the whole picture – not just our little Fisherian tale (tail?).

     

    Recently a correspondent asked me if, when I was a grad student, did I adopt Tukey as a career model. I told him no – not because I wouldn't have loved to be just like him, but that  that was impossible. It is akin to having Mozart as your piano teacher, or Einstein as your middle school science teacher (he did do that briefly). Tukey's mind was in the orthogonal complement of mine – what he did was often indistinguishable from magic. But one thing we all learned early on was to take whatever he said very seriously indeed (even if it didn't seem to make sense to you initially). You would eventually learn that Tukey was trying to move you in the right direction. Mosteller was possessed of a different sort of genius – one closer to the altitude at which most of us lived – infused with kindness and enormous practical wisdom.

     

    H

     

     

     

     






  • 38.  RE: hypothesis formulation

    Posted 08-04-2023 14:52

    Thanks Howard for the reality check! ...

    Regarding your comment about "the focus on rigid adherence to certain statistical testing dogma in the face of the enormous variation in the quality of data gathering", that football example is great. It reminded me of how "statistical significance" as a publication criterion has distorted so much of the scientific literature and helped fuel the "replication crisis", as seen in Figure 1 of van Zwet & Cator 2021, https://onlinelibrary.wiley.com/doi/full/10.1111/stan.12241; yet defenses of that criterion continue, bringing to mind Daniel Kahneman's observation that

    "…illusions of validity and skill are supported by a powerful professional culture. We know that people can maintain an unshakeable faith in any proposition, however absurd, when they are sustained by a community of like-minded believers."

    Continuing on the topic of reforms to basic statistical training, I had earlier called for adding dimensions for classifying statistical procedures by goals. The well-known frequentist-Bayes spectrum might be viewed as ranging from calibration to predictive goals. Pure likelihood is sometimes placed toward the middle but it feels to me a bit forced placing it there. Adding a dimension staked out by information-summarization on one side and decision on the other enables seeing pure likelihood as falling on the summarization end alongside concepts like divergence P-value functions (compatibility distributions) and reference ("objective") Bayes, while decision theories like NP hypothesis testing and operational (betting or personalistic) Bayes fall on the other end. Of course there is a continuum across these dimensions as can be seen for example with hierarchical (multilevel) models.

    My UCLA colleague Neal Fultz pointed out a third dimension that has become prominent in recent decades and worthy of inclusion in basic education, ranging from purely descriptive goals as in surveys to causal-inference goals as in experiments. The formal distinction can be traced at least back a century to Neyman 1923 (translation in Statistical Science 1990) with its use of what we now call potential outcomes (his potential yields from a given crop variety; see p. 466-467 of the 1990 translation). His potential-outcome model began appearing in the English biometry literature by the 1930s and was a standard tool there by the time I was taking stats (e.g., in Biometrika see Welch 1937, Wilk 1955, Copas 1973). Then too, informal discussions of causation as a counterfactual concept can be found earlier in Fisher and as far back as Hume in the mid-18th century (Pearl, Causality 2009 2nd ed. has a nice history); a formal bridge across the spectrum from survey description to causal modeling was provided by Rubin's recognition (Ann Stat 1978) that counterfactual treatments can be mapped into missing potential outcomes. So I think it safe to say the inclusion of the descriptive-causal dimension has long and sound historical and mathematical footings.

    My one caution in adding the descriptive/causal dimension is that all real-world applications of probability and statistics depend on causal elements: Use of probabilities requires some sort of justification in terms of the probabilities having been deduced from information about the actual causal process (physical mechanism) generating the data. That would include physical "objective" quantum-mechanical distributions as well as rational "subjective" personal betting schedules: Both are or should be determined from the observed data-generating set up. This dependency of probabilities on mechanisms makes it all the more imperative that causal concepts and models be integrated into basic statistical training. A more detailed argument for that view can be found at https://arxiv.org/abs/2011.02677.



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 39.  RE: hypothesis formulation

    Posted 08-04-2023 14:55

    Thanks Howard for the reality check! ...

    Regarding your comment about "the focus on rigid adherence to certain statistical testing dogma in the face of the enormous variation in the quality of data gathering", that football example is great. It reminded me of how "statistical significance" as a publication criterion has distorted so much of the scientific literature and helped fuel the "replication crisis", as seen in Figure 1 of van Zwet & Cator 2021, https://onlinelibrary.wiley.com/doi/full/10.1111/stan.12241; yet defenses of that criterion continue, bringing to mind Daniel Kahneman's observation that

    "…illusions of validity and skill are supported by a powerful professional culture. We know that people can maintain an unshakeable faith in any proposition, however absurd, when they are sustained by a community of like-minded believers."

    Continuing on the topic of reforms to basic statistical training, I had earlier called for adding dimensions for classifying statistical procedures by goals. The well-known frequentist-Bayes spectrum might be viewed as ranging from calibration to predictive goals. Pure likelihood is sometimes placed toward the middle but it feels to me a bit forced placing it there. Adding a dimension staked out by information-summarization on one side and decision on the other enables seeing pure likelihood as falling on the summarization end alongside concepts like divergence P-value functions (compatibility distributions) and reference ("objective") Bayes, while decision theories like NP hypothesis testing and operational (betting or personalistic) Bayes fall on the other end. Of course there is a continuum across these dimensions as can be seen for example with hierarchical (multilevel) models.

    My UCLA colleague Neal Fultz pointed out a third dimension that has become prominent in recent decades and worthy of inclusion in basic education, ranging from purely descriptive goals as in surveys to causal-inference goals as in experiments. The formal distinction can be traced at least back a century to Neyman 1923 (translation in Statistical Science 1990) with its use of what we now call potential outcomes (his potential yields from a given crop variety; see p. 466-467 of the 1990 translation). His potential-outcome model began appearing in the English biometry literature by the 1930s and was a standard tool there by the time I was taking stats (e.g., in Biometrika see Welch 1937, Wilk 1955, Copas 1973). Then too, informal discussions of causation as a counterfactual concept can be found earlier in Fisher and as far back as Hume in the mid-18th century (Pearl, Causality 2009 2nd ed. has a nice history); a formal bridge across the spectrum from survey description to causal modeling was provided by Rubin's recognition (Ann Stat 1978) that counterfactual treatments can be mapped into missing potential outcomes. So I think it safe to say the inclusion of the descriptive-causal dimension has long and sound historical and mathematical footings.

    My one caution in adding the descriptive/causal dimension is that all real-world applications of probability and statistics depend on causal elements: Use of probabilities requires some sort of justification in terms of the probabilities having been deduced from information about the actual causal process (physical mechanism) generating the data. That would include physical "objective" quantum-mechanical distributions as well as rational "subjective" personal betting schedules: Both are or should be determined from the observed data-generating set up. This dependency of probabilities on mechanisms makes it all the more imperative that causal concepts and models be integrated into basic statistical training. A more detailed argument for that view can be found at https://arxiv.org/abs/2011.02677.



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 40.  RE: hypothesis formulation

    Posted 08-05-2023 11:53

    Hi Sander,

    I think I must leave this conversation – although I am enjoying it immensely – for I have work to do and limited time and energy.

    But let me add one final observation.

     

    I completely agree with your assessment of the importance of Don Rubin's adjoining of the study of missing data with the critical problem of causal inference. I think it is the most important contribution on this topic since Hume.  Don and I (mostly Don) showed how this formulation can be used in difficult circumstances (in this case when the data are censored by death) by thinking carefully:

    Causal Inference and Death, Chance, 28(2), 58-64, 2015 – attached.

     

    That said, I am not sure whether your dimensional metaphor is necessarily the only way to think about this.

    I am very fond of Steve Stigler's book on this topic (The Seven Pillars of Statistical Wisdom – see attached), and his biblical representation works very well indeed.

     

    H

     

     




    Attachment(s)



  • 41.  RE: hypothesis formulation

    Posted 08-05-2023 14:26

    Thanks Howard for the links and further comments...

    Allow me to clear up what may be a misunderstanding: You wrote

    "I am not sure whether your dimensional metaphor is necessarily the only way to think about this".

    I don't see where I or anyone suggested it was the only way to think about it. On the contrary, I welcome all reasonable perspectives, and believe that (up to some number rarely seen in statistics) the more the better. Each perspective is one of an unlimited number, and each is limited, conveying only the information available from that perspective. This notion can be traced back to ancient India yet seems routinely forgotten in human debates, including philosophical and scientific ones:

    "The parable of the blind men and an elephant is a story of a group of blind men who have never come across an elephant before and who learn and imagine what the elephant is like by touching it. Each blind man feels a different part of the elephant's body, but only one part, such as the side or the tusk. They then describe the elephant based on their limited experience and their descriptions of the elephant are different from each other. In some versions, they come to suspect that the other person is dishonest and they come to blows. The moral of the parable is that humans have a tendency to claim absolute truth based on their limited, subjective experience as they ignore other people's limited, subjective experiences which may be equally true." https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

    Thus I think Stigler's view as in 7 pillars is great; my main quibble is I would have placed design (his #6) first and foremost, assuming it includes design of surveys and of nonexperimental studies of causation as well as of experiments. Based on the fact that Don Rubin has written how "Design trumps Analysis" I think he might concur with that improvement. 

    Going further, I think all 7 pillars could be translated into dimensions. Nonetheless, because the pillars are more often points on a dimension, we'd have to add elements, for example to pillar 2 to capture the dimension of information-summarization vs. decision; to pillar 3 to capture the dimension of frequentist vs. Bayes; and to pillars 4-6 to capture the dimension of passive prediction (pure regression) vs. causation (predicting outcomes after mutually exclusive interventions or decisions).

    I'll forego details as the point is only that, far too often (as llustrated by endless frequentist vs. Bayesian controversies), alternative viewpoints are treated as if competitors when more often they are complementary reality checks that can be used in tandem and even merged together profitably.

    As I hope that makes clear, I very much agree that we should view statistics as a living science, as you mention in your review of Stigler. That means it should not be cemented to approaches that have caused harms, and it should seek to upgrade or replace those approaches to reduce harms and improve benefits. We expect as much of medical training and practice; we should hold statistics to the same commitment to continuing progress and reform rather than to immutable tradition and doctrinal authority.

    All the Best,
    Sander



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 42.  RE: hypothesis formulation

    Posted 08-05-2023 17:14

    I agree – there are many paths to salvation.

    The exploration of alternative, viable, ways of thinking about things is what makes this sort of conversation so habit-forming.

    But I have a book to write and miles to go before I sleep.

    Thanks for allowing me to join in.

    H

     






  • 43.  RE: hypothesis formulation

    Posted 08-07-2023 15:07

    Thanks Howard for your input - much appreciated! And good luck with your book, which I shall be interested to hear more about.

    For those interested, due to some technical issues the debate between me and Eugene Komaroff has been continued on another thread on Cut Points" at https://community.amstat.org/discussion/cut-point



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 44.  RE: hypothesis formulation

    Posted 07-21-2023 08:12

    I'm sure that the Bayesian clinical trialists reading this (e.g., perhaps, Jason Connor, Don and Scott Berry, Ben Saville) are smirking (yet again!) about how frequentism leads to such knotty, nuanced, and contentious discussions, discussions that even the most rigorous subject-matter scientists would see as hair-splitting.

    And please don't plunge into the blue-green algae of the confusing TOST strategy for equivalence testing. Computing the ordinary 100*(1 - 2*alpha)% CI (e.g. 90%) is simpler and more informative, and only a simple probability argument is needed to show that the maximum Type I error rate is alpha (e.g. 0.05).



    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------



  • 45.  RE: hypothesis formulation

    Posted 07-21-2023 08:48

    Hello Ralph.  We once chatted briefly at a coffee dispenser at an ASA conference hotel about the peculiar practice of young PhD (or candidate) statisticians who present their work by reading every letter and symbol on their slide. 

    I have no quarrel with Bayesians.  J.G. Ibraham gave a compelling/convincing presentation at the October 2017, the American Statistical Association (ASA) held the Symposium on Statistical Inference about the valuable utility of priors when the data are very sparse for clinical trials with end-stage cancer patients. As a result, I now lean towards the pragmatist world view of the Mixed Methodologist who are trying to integrate QUAL and QUAN methods.  Perhaps impossible to merge the Frequentist and Bayesian paradigms theoretically, but driving a screw with a hammer will not do the job.  



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------



  • 46.  RE: hypothesis formulation

    Posted 07-21-2023 15:12

    Ralph: You wrote that relative to TOST, "computing the ordinary 100*(1 - 2*alpha)% CI (e.g. 90%) is simpler and more informative, and only a simple probability argument is needed to show that the maximum Type I error rate is alpha (e.g. 0.05)."
    -I find that comment puzzling because, according to literature I have seen, the two procedures are mathematically the same for NP α-level testing of a nonequivalence hypothesis. Also, I cannot see what is so complicated about presenting the TOST procedure simply as it name suggests: Do two one-sided tests, one for each boundary of the equivalence interval [-r,r] (where r is the radius of equivalence).
    Can you provide more detailed reasoning for your remark? 

    In doing so, please note I would agree that a CI is much more informative than a single test...But, there is a teaching problem here in that the TOST procedure is giving a level-α test while the CI is at the 1-2α level. Thus I think the CI version of the equivalence test could confuse students about the level of the test, as the CI version sees whether the 1-2α interval is within the equivalence interval

    [-r,r]. TOST avoids this apparent conflict of α-levels within the test.

    Of course, even if one stays with the TOST procedure, confusion might be unavoidable once students see that the 1-α interval may extend beyond [-r,r] even if the α-level TOST procedure rejects nonequivalence. Explaining that will require careful coverage of the distinction between two-sided and one-sided intervals and tests, since the CI for the test inverts two-sided tests but the TOST procedure is based on one-sided tests whose P-values are half the two-sided P-values that determine whether -r and r are in the CI. 

    I might agree that the whole problem could be seen as showing "how frequentism leads to such knotty, nuanced, and contentious discussions." I would respond however that Bayesian statistics has its own knotty, nuanced, and contentious problems as seen in the realms of choice of priors for reference ("objective") Bayes, for Bayesian significance tests, and for calibration of Bayes procedures. And then there's the knotty, nuanced aspects of reference vs. personalistic ("subjective") Bayes debates, where incredibly contentious issues of prior specification dominate and even the very axioms of probability get called into question (e.g., countable additivity). If the amount of knottiness, nuance and contention seems any less in the Bayesian sphere, I suspect it is only because Bayesians were for a long time (and may still be) a far smaller portion of academic statisticians than were frequentists (during the last half of the 20th century it was hard to find a U.S. stat division that had even one Bayesian!).

    For the record I adhere to the toolkit view of statistics in which the contextual questions at hand should determine the methods used, different questions in the same study may require different methods, and often it is helpful to use different methods to answer the same question. In a real interval-testing problem, in addition to a P-value for the interval, I would certainly be interested in posterior probabilities for the hypothesized interval if contextually reasonable priors could be formulated within the allotted resources.

    Best,



    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 47.  RE: hypothesis formulation

    Posted 07-24-2023 12:33

    I am preparing a solid response to Sander's request that I "provide more detailed reasoning" to my quip that relative to TOST, "computing the ordinary 100*(1 - 2*alpha)% CI (e.g. 90%) is simpler and more informative, and only a simple probability argument is needed to show that the maximum Type I error rate is alpha (e.g. 0.05)."

    However, because this is an outgrowth of this thread ("hypothesis formulation"), I'll post it separately, with a subject line of "A primer on equivalence testing, delivered via an R function designed for self-learning and teaching." Or something shorter.

    Please stay tuned.



    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------



  • 48.  RE: hypothesis formulation

    Posted 07-24-2023 13:33

    Dr. Obrien.  Before you start with R, take a looks at Daniel Laken's web site where he said in 2017: "I've created my first R package, TOSTER (as in Two One-Sided Tests for Equivalence in R). Don't worry, there is also an old-fashioned spreadsheet available as well (see "TOSTER Materials," below)."  Http://www.psychologicalscience.org/observer/equivalence-testing-with-toster

    I submitted an abstract to the ASA Conference on Statistical Practice 2022 to demonstrate/explain how it can be done with SPSS syntax.  It was rejected because they received many papers and can invite only a very small fraction for presentation. 



    ------------------------------
    Eugene Komaroff
    Professor of Education
    Keiser University Graduate School
    ------------------------------