Thanks Howard for your input - much appreciated! And good luck with your book, which I shall be interested to hear more about.
For those interested, due to some technical issues the debate between me and Eugene Komaroff has been continued on another thread on Cut Points" at https://community.amstat.org/discussion/cut-point
Original Message:
Sent: 08-05-2023 17:14
From: Howard Wainer
Subject: hypothesis formulation
I agree – there are many paths to salvation.
The exploration of alternative, viable, ways of thinking about things is what makes this sort of conversation so habit-forming.
But I have a book to write and miles to go before I sleep.
Thanks for allowing me to join in.
H
Original Message:
Sent: 8/5/2023 2:26:00 PM
From: Sander Greenland
Subject: RE: hypothesis formulation
Thanks Howard for the links and further comments...
Allow me to clear up what may be a misunderstanding: You wrote
"I am not sure whether your dimensional metaphor is necessarily the only way to think about this".
I don't see where I or anyone suggested it was the only way to think about it. On the contrary, I welcome all reasonable perspectives, and believe that (up to some number rarely seen in statistics) the more the better. Each perspective is one of an unlimited number, and each is limited, conveying only the information available from that perspective. This notion can be traced back to ancient India yet seems routinely forgotten in human debates, including philosophical and scientific ones:
"The parable of the blind men and an elephant is a story of a group of blind men who have never come across an elephant before and who learn and imagine what the elephant is like by touching it. Each blind man feels a different part of the elephant's body, but only one part, such as the side or the tusk. They then describe the elephant based on their limited experience and their descriptions of the elephant are different from each other. In some versions, they come to suspect that the other person is dishonest and they come to blows. The moral of the parable is that humans have a tendency to claim absolute truth based on their limited, subjective experience as they ignore other people's limited, subjective experiences which may be equally true."
https://en.wikipedia.org/wiki/Blind_men_and_an_elephantThus I think Stigler's view as in 7 pillars is great; my main quibble is I would have placed design (his #6) first and foremost, assuming it includes design of surveys and of nonexperimental studies of causation as well as of experiments. Based on the fact that Don Rubin has written how "Design trumps Analysis" I think he might concur with that improvement.
Going further, I think all 7 pillars could be translated into dimensions. Nonetheless, because the pillars are more often points on a dimension, we'd have to add elements, for example to pillar 2 to capture the dimension of information-summarization vs. decision; to pillar 3 to capture the dimension of frequentist vs. Bayes; and to pillars 4-6 to capture the dimension of passive prediction (pure regression) vs. causation (predicting outcomes after mutually exclusive interventions or decisions).
I'll forego details as the point is only that, far too often (as llustrated by endless frequentist vs. Bayesian controversies), alternative viewpoints are treated as if competitors when more often they are complementary reality checks that can be used in tandem and even merged together profitably.
As I hope that makes clear, I very much agree that we should view statistics as a living science, as you mention in your review of Stigler. That means it should not be cemented to approaches that have caused harms, and it should seek to upgrade or replace those approaches to reduce harms and improve benefits. We expect as much of medical training and practice; we should hold statistics to the same commitment to continuing progress and reform rather than to immutable tradition and doctrinal authority.
All the Best,
Sander
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 08-05-2023 11:53
From: Howard Wainer
Subject: hypothesis formulation
Hi Sander,
I think I must leave this conversation – although I am enjoying it immensely – for I have work to do and limited time and energy.
But let me add one final observation.
I completely agree with your assessment of the importance of Don Rubin's adjoining of the study of missing data with the critical problem of causal inference. I think it is the most important contribution on this topic since Hume. Don and I (mostly Don) showed how this formulation can be used in difficult circumstances (in this case when the data are censored by death) by thinking carefully:
Causal Inference and Death, Chance, 28(2), 58-64, 2015 – attached.
That said, I am not sure whether your dimensional metaphor is necessarily the only way to think about this.
I am very fond of Steve Stigler's book on this topic (The Seven Pillars of Statistical Wisdom – see attached), and his biblical representation works very well indeed.
H
Original Message:
Sent: 8/4/2023 2:55:00 PM
From: Sander Greenland
Subject: RE: hypothesis formulation
Thanks Howard for the reality check! ...
Regarding your comment about "the focus on rigid adherence to certain statistical testing dogma in the face of the enormous variation in the quality of data gathering", that football example is great. It reminded me of how "statistical significance" as a publication criterion has distorted so much of the scientific literature and helped fuel the "replication crisis", as seen in Figure 1 of van Zwet & Cator 2021, https://onlinelibrary.wiley.com/doi/full/10.1111/stan.12241; yet defenses of that criterion continue, bringing to mind Daniel Kahneman's observation that
"…illusions of validity and skill are supported by a powerful professional culture. We know that people can maintain an unshakeable faith in any proposition, however absurd, when they are sustained by a community of like-minded believers."
Continuing on the topic of reforms to basic statistical training, I had earlier called for adding dimensions for classifying statistical procedures by goals. The well-known frequentist-Bayes spectrum might be viewed as ranging from calibration to predictive goals. Pure likelihood is sometimes placed toward the middle but it feels to me a bit forced placing it there. Adding a dimension staked out by information-summarization on one side and decision on the other enables seeing pure likelihood as falling on the summarization end alongside concepts like divergence P-value functions (compatibility distributions) and reference ("objective") Bayes, while decision theories like NP hypothesis testing and operational (betting or personalistic) Bayes fall on the other end. Of course there is a continuum across these dimensions as can be seen for example with hierarchical (multilevel) models.
My UCLA colleague Neal Fultz pointed out a third dimension that has become prominent in recent decades and worthy of inclusion in basic education, ranging from purely descriptive goals as in surveys to causal-inference goals as in experiments. The formal distinction can be traced at least back a century to Neyman 1923 (translation in Statistical Science 1990) with its use of what we now call potential outcomes (his potential yields from a given crop variety; see p. 466-467 of the 1990 translation). His potential-outcome model began appearing in the English biometry literature by the 1930s and was a standard tool there by the time I was taking stats (e.g., in Biometrika see Welch 1937, Wilk 1955, Copas 1973). Then too, informal discussions of causation as a counterfactual concept can be found earlier in Fisher and as far back as Hume in the mid-18th century (Pearl, Causality 2009 2nd ed. has a nice history); a formal bridge across the spectrum from survey description to causal modeling was provided by Rubin's recognition (Ann Stat 1978) that counterfactual treatments can be mapped into missing potential outcomes. So I think it safe to say the inclusion of the descriptive-causal dimension has long and sound historical and mathematical footings.
My one caution in adding the descriptive/causal dimension is that all real-world applications of probability and statistics depend on causal elements: Use of probabilities requires some sort of justification in terms of the probabilities having been deduced from information about the actual causal process (physical mechanism) generating the data. That would include physical "objective" quantum-mechanical distributions as well as rational "subjective" personal betting schedules: Both are or should be determined from the observed data-generating set up. This dependency of probabilities on mechanisms makes it all the more imperative that causal concepts and models be integrated into basic statistical training. A more detailed argument for that view can be found at https://arxiv.org/abs/2011.02677.
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 08-03-2023 15:48
From: Howard Wainer
Subject: hypothesis formulation
Hi Sander,
I was replying to Constantine (I don't believe I have ever met him).
I don't think you and I disagree at all on any of this.
Two additional things:
1. I just remembered Tukey's whole remark about 1 tailed tests (I underline the part I left out previously)
"exclaiming 'don't ever invent a test, because if you do someone will surely ask for the one-tailed values. If there was such a thing as a half-tailed test they would ask for those values too'." (I hope no one now starts discussing how a half-tailed test might work – Tukey, in his own way, was making a joke.
2. The focus on rigid adherence to certain statistical testing dogma in the face of the enormous variation in the quality of data gathering reminds me of what we see each week in the NFL. A play is run and the referees unpile a large number of very big men and then plunk down the ball in the place they believe represents its forward progress. Then they haul out a 10-yard-long chain and measure to the nearest millimeter to see if enough yardage has been gained to yield a first down. We statisticians represent the chain and the referees the subject matter scientists. Being overly precise on our end doesn't make a dent in the precision of the entire enterprise. We would be better off trying to adapt our methods to suit the situation and thus provide more light on the problem -- maybe adapting the methods used so successfully in tennis to judge whether a ball is in or out has an analog in football? The idea is to look at the whole picture – not just our little Fisherian tale (tail?).
Recently a correspondent asked me if, when I was a grad student, did I adopt Tukey as a career model. I told him no – not because I wouldn't have loved to be just like him, but that that was impossible. It is akin to having Mozart as your piano teacher, or Einstein as your middle school science teacher (he did do that briefly). Tukey's mind was in the orthogonal complement of mine – what he did was often indistinguishable from magic. But one thing we all learned early on was to take whatever he said very seriously indeed (even if it didn't seem to make sense to you initially). You would eventually learn that Tukey was trying to move you in the right direction. Mosteller was possessed of a different sort of genius – one closer to the altitude at which most of us lived – infused with kindness and enormous practical wisdom.
H
Original Message:
Sent: 8/3/2023 3:22:00 PM
From: Sander Greenland
Subject: RE: hypothesis formulation
Thanks Howard -
Were you replying to me or to Constantine, or maybe to both of us? I wasn't sure.
If to me or to both of us:
I thought I did get the joke, but maybe I didn't...
My apologies if the humor in my response may have been too dry;
I can only hope it worked at least for Jerry Seinfeld and Larry David fans.
I thought I was agreeing with you about trinary testing. I was merely adding that I thought it even better to allow for even more possible potential decisions, for example as when one has to choose among treatment doses.
I certainly agree that we would be rewarded by departing from dogmatic adherence to a set of formal rules established a century ago; I think that's a notion behind what I've written in the earlier posts here and in the citations I've given.
Finally, I hope we also agree that the ongoing debate would benefit from more of the very practical wisdom of Mosteller, Tukey and the like.
Best,
Sander
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 08-03-2023 13:06
From: Howard Wainer
Subject: hypothesis formulation
It is a rare joke that can survive clinical dissection.
Most people found:
"his chances of winning were the same whether he bought a ticket or not"
very funny. I'm sorry you didn't get the joke.
Obviously, I need to learn to write more clearly – my point about trinary hypothesis testing – which, judging from your response, wasn't clearly made – is that we would be rewarded by departing from the too dogmatic adherence to a set of formal rules established a century ago. Mosteller's (or was it Tukey's?) suggestion that I relayed is but one example. I'm sorry that you missed the point – I'm sure the blame is mine.
Howard Wainer
Original Message:
Sent: 8/3/2023 11:27:00 AM
From: Constantine Daskalakis
Subject: RE: hypothesis formulation
Dear Howard:
I am not sure about your #3 point.
You say,
So instead we switch to a trinary set of hypotheses H1: Mean 1 > mean 2, H2: mean 1 < mean 2, H3: we don't have enough data yet to tell. I have long felt that the theoretical flexibility represented by this sort of thinking brings the formal world of hypothesis testing closer to the real world we live in.
To start off, H3 is not a hypothesis. Hypotheses refer to the state of the universe (true parameter value), not the type of conclusion we draw based on (limited) data. Also, for completeness the = has to go somewhere, although for continuous distributions, it doesn't make any difference whether we put it in H1 or H2.
Perhaps you are thinking of a decision rule tritomy, ie, you mean to test hypotheses
H1: Mean1 >= Mean2 vs
H2: Mean1 < Mean2
but, instead of the dichotomous significance yes/no, we should adopt a decision tritomy (decide H1, undetermined, decide H2).
Even so, you now just displaced the problem from the significant/non-significant boundary to the 2 boundaries in the tritomy, ie, decideH1/undetermined and decidedH2/undetermined. It's the same problem, but now at 2 places!
IMO, the problem is inherent to the decision analytic perspective. If you want to come up with some sort of decision, you'll always have to draw a line somewhere to distinguish between different types of decision, and then that line invites vigorous debate (on relevance, prior beliefs, errors, costs, etc.). On the other hand, estimation gives each individual a "best" guess and the degree of uncertainty associated with it, but stops there, letting each person take the next step of making a decision or forming a belief individually. Many people typically don't like that because
(1) they don't have enough skills to take that next step, and
(2) psychologically, they prefer to be given a hard and fast black/white rule that they can follow.
Hence the widespread preference for decision rules (eg, statistical significance) vs. pure estimation, IMO.
Finally, I also think your statistician would be rather foolish to make the statement
My mechanic was telling me that he had to leave work early to buy a lottery ticket. I told him that his chances of winning were the same whether he bought a ticket or not.
Obviously, the chance is exactly 0 if you don't buy a ticket and some positive non-zero value if you do. If we want to make a statement about the state of the world, the statistician's statement is patently nonsense.
Furthermore, even if you mean that, FOR YOU, that non-zero chance is close enough to 0, so that you feel it's the "same", the statement seems to assume that your implicit and unstated "equivalence boundary" and your judgment about its relative value in your life is universal and the same as the mechanic's (cost function of the errors). Why would it be so? If you replaced "same" with "not meaningfully different" or even better with "the very small chance of winning is not worth buying a ticket" in your statement, it would be much clearer why the mechanic might (logically and justifiably) disagree.
Best regards,
Constantine
______________________________________________________________
Constantine Daskalakis, ScD
he/him/his (hear name)
Professor
Div. of Biostatistics
Dept. of Pharmacology, Physiology, and Cancer Biology
Thomas Jefferson University
Edison Bldg #1749, 130 S 9th St, Philadelphia, PA 19107
(215) 955-5695
Original Message:
Sent: 8/2/2023 3:24:00 PM
From: Howard Wainer
Subject: RE: hypothesis formulation
I have enjoyed this discussion and have, up to know, been delighted to stand on the sidelines and learn. But let me add two small things to the discussion that may be of interest:
1) one-tailed vs. two tailed tests - as a graduate student I remembver John Tukey once exclaiming "don't ever invent a test, because if you do someone will surely ask for the one-tailed values." He was then asked "do you mean you should never do a one-tailed test?" "No" he replied," it depends on who you're talking to -- some people will believe anything."
What was he getting at? The key idea is that if you are willing to reject one hypothesis because it is very unlikely given the data you observed (forgive this Bayesian view -- a more frequentist statement might be because the data observed are unlikely given that hypothesis) you should also reject a similarly unlikely event at the other extreme. Let me offer one example: a chi-square is ordinarily thought of as a naturally one-tailed test, but there is the other other tail (a very short one, for sure) that might correspond to the data fitting too well -- better than you would expect. So, for example, had a two-tailed test been done of Cyril Burt's twin data we might have uncovered his fabrications much sooner.
More on this is in a 50 year old paper by my favorite author:
The other tail. The British Journal of Mathematical and Statistical Psychology, 26, 182-187, 1973.
2) If hypothesis testing is not held to traditional binary structures too rigidly some interesting alternatives emerge. One example that comes to mind (its origin was, I think, Fred Mosteller, but it too could've been Tukey). Consider a binary set of hypoteses on population means -- say Ho: mean 1 = mean 2 vs. H1: Mean 1 unequal to mean 2. We all know that the likelihood of two means being exactly equal is usually vanishingly small, and if we just had a big enough sample we could show it. So why bother doing the experiment since we know that with a better (big enough) experiment we could reject Ho? So instead we switch to a trinary set of hypotheses H1: Mean 1 > mean 2, H2: mean 1 < mean 2, H3: we don't have enough data yet to tell. I have long felt that the theoretical flexibility represented by this sort of thinking brings the formal world of hypothesis testing closer to the real world we live in.
My mechanic was telling me that he had to leave work early to buy a lottery ticket. I told him that his chances of winning were the same whether he bought a ticket or not. This is one example of trinary hypothesis testing.
------------------------------
Howard Wainer
Extinguished Research Scientist
Original Message:
Sent: 08-01-2023 14:10
From: Eugene Komaroff
Subject: hypothesis formulation
Professor. Greenland. Our discussion started with my objection to an inequality sign in a null hypothesis statement. You offered the equivalence test as an example of an interval null hypothesis and sent me to Hodges and Lehmann (1954) for an explanation. I liked their first sentence from their Summary (abstract) below, but stopped reading after the last sentence.
"The distinction between statistical significance and material significance in hypotheses testing is discussed. Modifications of the customary tests, in order to test for the absence of material significance, are derived for several parametric problems, for the chi-square test of goodness of fit, and for Student's hypothesis. The latter permits one to test the hypothesis that the means of two normal populations of equal variance, do not differ by more than a stated amount"( Hodges & Lehmann, 1954, p. 165).
The first sentence resonates to the present day. The conflation of statistical significance with substantive significance needs to stop immediately. These concepts are related but not identical. At the end, the mention of Student's hypothesis and the words "differed more than the stated amount" is familiar. Student's hypothesis is most likely his innovative, small sample, standard error that replaced the population sigma in the large sample z-test. The "stated amount" is called the "margin of equivalence" today.
BTW, researchers struggle to postulate a reaonable margin of equivalence for a sample size calculation. They have the same difficulty coming up with reasonable alternative parameter. Their response: if I knew that, I would not be working on this grant proposal.
To debate whether one should use a p-value or confidence interval to test a point null hypothesis is waste of precious mental energy. Fisher and Neyman were both right! I prefer p < α for statistical significance because it is easier than making sure that the point null parameter is not included in a 1-α confidence interval.
It appears your dislike of point null hypothesis stems from the well documented (by you and others) of the blatant abuse/misuse and/or naïve misunderstanding of the concept of statistical significance. However, "statistically significant - don't say it and don't use it" is not the solution - proper education is the cure. On the other hand, a ban on the ridiculous misinterpretation of statistical significance as substantive significance is urgently needed. This flawed conflation has been forcefully magnified by research articles in scholarly, peer-reviewed journals.
------------------------------
Eugene Komaroff
Professor of Education
Keiser University Graduate School
Original Message:
Sent: 07-28-2023 20:15
From: Sander Greenland
Subject: hypothesis formulation
Dear Eugene,
Please forgive my tardy reply - I have had to attend to other matters over the past few days.
Also, with regrets I may have to delay response to your other (reposted) list until the weekend...
I am of course deeply flattered by and thank you heartily for your too-kind remarks. I confess was most surprised given the earlier parts of our exchange, so I was at a loss at how to respond. As for being a Goliath, the proper term might instead be dinosaur.
I should say (and perhaps should have said sooner) that I have seen your work in the past and thought it was eminently sensible (which is the highest compliment I know of for a scientist or engineer, including statisticians among those). Furthermore you seem to be operating from views not far from mine. So I was taken aback at the contentiousness and the confusion of my points with more radical views and proposals, especially as I have been a staunch defender of P-values and neoFisherian (informationalist) ideas against attacks from all sides (NP, likelihoodist, Bayesian).
Also, I have had trouble understanding some of your statements - it seems as if we speak different dialects, leading to misunderstandings when words are the same but their meanings are shifted (as in "false French friends" or other cognate confusions, illustrating the importance of semantics in discussing statistics):
I am asking you to simply help me understand your pushback to my statement: An inequality in the null hypothesis is conceptually understandable as a one-tailed test, but mathematically is impossible. This statement is true because I believe in the theory of sampling distributions. Let's completely remove the equality to minimize confusion and state H: d < 0. Please show me a computer program or tell me the statistical software that I can use to evaluate your one-tailed inequality hypothesis.
I simply could not fathom what you meant by that passage. What is mathematically impossible?Also I am unclear why you are dropping the boundary point of zero from H, although assuming continuity of the parameter, statistics, and distributions I believe this only means we'd have to shift from minima to infima in some technical descriptions, so for now I can accommodate it.With all that continuity in place, then, as I wrote before and unless I have made a mistake, the standard one-sided P-value for d=0 provides a valid (i.e., size≤α for all d in R(H) = {d: d<0}) test of H: d<0 via the NP test (decision rule) "reject H if p≤α" and its distribution dominates a uniform variate if H holds. Are you claiming that this P-value or test of H is not valid? Under continuity that one-sided P-value is the Lehmann (NP) decision P-value for H; but it's not the divergence P-value, which is instead twice that and thus equals (but is not defined as) the usual Fisherian two-sided P-value for d=0.The rest of this post just elaborates on points I covered earlier in this thread, offered only in case there is any residual misunderstandings about my goal in answering you with NP theory: it was simply to show that, in terms used by the most entrenched system in American statistics of the latter 20th century and used throughout journals and policy, it is quite possible and often easy to test an H defined by inequalities (as shown by Hodges & Lehmann, AMS 1954).My use of NP tests to respond should not however be taken as an endorsement of NP: quite the contrary, I prefer neoFisherian divergence ideas for the kind of problems I have encountered in health and medical sciences. Those ideas can also generate a P-value for H defined by inequalities, often just by switching to maximization over H of two-sided P-values; I don't like to call those P-values "tests" however because that might suggest they are part of NP theory, which is inappropriate here. Divergence P-values produce valid tests but those are less powerful when they differ from NP-optimal decision P-values. I would rather see divergence P-values described as indices of compatibility between H and the data given background assumptions - or from a model-checking view, compatibility between a specific model M and a more general, less restricted model A, in light of the data. For more of the theory see sec. 2 and the Appendix of the Greenland 2023 SJS main paper.Now to repeat some laments from earlier in our thread, in tendentious and perhaps tedious detail:I was forced into the role of a P-value defender when the journal Epidemiology (on which I had been one of the founding editors in 1990 and has since become one of the top journals in its field, especially for epidemiologic methods) banned display of P-values for parameters, a move I protested without success. Since then I have been involved in dozens of articles aimed at instructors and researchers about how to teach and use P-values in ways that I found help avoid the misuse that P-critics complain about. An ideologically diverse and contentious group of colleagues still managed to agree enough to catalog major misuses in TAS 2016 (Greenland, Senn, Rothman, J. Carlin, Poole, Goodman, and Altman). We advised presenting P-values as the numbers they are, not as inequalities like "p<0.05" (which can be done even if their interpretation makes reference to alpha levels), a move advised by authorities both from the NP tradition (e.g. Lehmann) and from the Fisherian tradition (e.g. Cox).
We all knew how common it remains that "statistical significance" or lack thereof is confused with practical significance or lack thereof, and how common it remains that of P-values are confused with alpha-levels, probably because both get called "significance levels". These confusions can be somewhat mitigated simply adopting long-standing, more precise terms in place of terms using "significance" or "significant". I later teamed with other colleagues to repeat that advice in several articles starting with Amrhein, Greenland and McShane in Nature 2019. Dishearteningly, that advice to change to less ambiguous yet familiar labels was promptly attacked and confused with calls for banning tests and P-values.Among other terminology reforms that we have advised are to replace talk of "significance" and "confidence" with compatibility, a usage that can be found in Fisher, and which by the start of this century could be found in several other worthy sources; and to replace "null hypothesis" with "tested hypothesis" (as Neyman did) or with "test" or "target" hypothesis, unless indeed the hypothesis is that a parameter is zero or that some variables are independent. We have also advocated teaching devices to aid perception of information by plotting P-values, and by transforming probability statements into physical experiments and natural frequencies, as Gigerenzer and colleagues demonstrated effective in many educational experiments - but that is another long story.It is discouraging to see how such simple constructive reforms to address calls for bans are resisted, with some critics writing as if we had made up these ideas (we merely compiled and blended them from across a vast literature stretching back to Pearson 1900), and as if the replaced terminology is sacred tradition (imagine defending offensive ethnic terms on grounds that it is only semantics and those terms can be used properly by those who are trained adequately). The result has been little change so far and thus continuing confusion among researchers, hence more calls for bans - some of which have been successful. Regardless of divergent philosophical stances about statistics, we need to constructively address critics with genuine changes, not hold onto what are often arbitrary traditions as if they reflect the soul of statistical science.
Best Wishes,
Sander
------------------------------Sander GreenlandDepartment of Epidemiology and Department of StatisticsUniversity of California, Los AngelesOriginal Message:Sent: 07-26-2023 05:54From: Eugene KomaroffSubject: hypothesis formulationDr. Greenland. Please forgive me if my remarks are unwarranted and offensive. You are a profound, theoretical statistician with an extensive and impressive publication record. You have earned the respect and the well-deserved reputation as a scholar and teacher not only from me, but from an entire lively but contentious world-wide community of statisticians. In fact, I dreamt you were Goliath, and I was David but had no stone in my pocket. I truly am honored but intimidated by your interest in my humble musings.
I am asking you to simply help me understand your pushback to my statement: An inequality in the null hypothesis is conceptually understandable as a one-tailed test, but mathematically is impossible. This statement is true because I believe in the theory of sampling distributions. Let's completely remove the equality to minimize confusion and state H: d < 0. Please show me a computer program or tell me the statistical software that I can use to evaluate your one-tailed inequality hypothesis.
------------------------------Eugene KomaroffProfessor of EducationKeiser University Graduate SchoolOriginal Message:Sent: 07-25-2023 17:24From: Sander GreenlandSubject: hypothesis formulationProf. Komaroff,I have looked at Student 1908 once more and see no 2-sided P-value in it. Thus I am not clear as to why you cited it.If there is a 2-sided P-value in it, please point us to exactly where it can be found.I am also unclear as to the purpose of your Fisher quote. I have often read that passage and others in scholarly articles attempting to explain the origin of the 0.05-cutoff convention. I have seen nothing in them however in which Fisher used the NP terms "alpha", "Type-I error" or "test size" to label or justify such cutoffs, even in his writings (such as your cite) long after they had become established in most Anglo-American statistics literature (apart from in his derogatory remarks about NP-Wald theory). If you have such a cite, please point us to exactly where it can be found.As for the 0.05 convention, both Fisher and Neyman separately (in their own terms) described the choice of testing cutoff as context dependent, e.g., Fisher said "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" (Statistical methods and scientific research, 2nd ed. 1959, p. 42). When I met Henry Oliver Lancaster in 1985, he recounted to me how, when asked if he regretted anything in his career, Fisher snapped back "Ever mentioning 0.05!".
Fisher's regret might have differed if in his Statistical Methods for Research Workers he had given more primacy to his earlier usage, for example "If the value of P so calculated, turned out to be a small quantity such as 0·01, we should conclude with some confidence that the hypothesis was not in fact true of the population actually sampled" ("Applications of Student's Distribution", Metron 1925, p. 90). Meanwhile, Neyman's final writings were exceptionally clear about how his fixed alpha-level needed to be based on costs of errors (e.g., Neyman, Synthese 1977). So I think it safe to say they both would have rejected any attempt at universal claim for 0.05. It is thus completely unclear to me how your quote of Fisher about probable error bears on the issues I have been discussing.Regarding Fisher's contributions, it seems you are eager to attribute to me views that I do not have and are in fact antithetical to what I have been writing here and publishing for years.You wrote:
Seems to me Professor Greenland wants us to believe that Fisher was nothing more than a social media influencer spreading misinformation.
That comment is so the opposite of true that I had to struggle to understand why you would say that. I think you have misread as if critical of Fisher my statement that 2-sided P-values "seem to have appeared only after two centuries, and had to wait for someone like Fisher to popularize them, might raise suspicions that they are less intuitive than the original one-sided P-value formulations". My fault for being unclear: I meant that it took a genius of Fisher's stature to clarify the concept and importance of 2-sided P-values so that they could achieve wide adoption, for (as I explained to Constantine) 2-sided P-values are more difficult to understand correctly than are 1-sided P-values. That difference in difficulty can be seen from the fact that 1-sided P-values are easy to express as limits of (and in fact originated from) Bayesian posterior probabilities (see Casella & Berger, JASA 1987; reviewed in Greenland, S., and Poole, C. 2013. Living with P-values: Resurrecting a Bayesian perspective. Epidemiology, 24, 62-68. ), whereas 2-sided P-values pose a challenge to Bayesian interpretations (e.g., see Bayarri & Berger, JASA 1987). Still, I think you might have read my remark correctly if you had been reading my posts carefully to their end and reading the articles I cite. Those contain quite favorable views of Fisher's ideas, and which start from preferring the informational foundation for statistics he promoted over the decision-theoretic foundation in NP theory, which he vehemently opposed. In fact in other posts I have classified my views as neo-Fisherian! For example, if you had read to the end of my reply to Constantine Daskalakis, you would have seen that for your example I expressed a preference for Fisher's 2-sided P-value as an information summary, even though (as I explained) the 1-sided P-value is dictated as the decision-theoretic summary in the strict NP-testing formulation given by Lehmann in TSH.I would point you again to a careful reading of the articles I have cited in this thread, including the recent pair in the Scandinavian Journal of Statistics,https://doi.org/10.1111/sjos.12625 https://doi.org/10.1111/sjos.12645which also cite Karl Pearson's theory of statistical model checking as part of the foundation, and build on that and Fisher's concepts of information, reference distributions, and significance levels, and the refinements of those concepts developed by Cox and colleagues. I only depart in taking care to relabel their "significance levels" as P-values (a relabeling which was already starting to happen in the 1920s, as Shafer documents, and was adopted by Cox in his final book in 2011), and in distinguishing their tail-area P-values from the minimum-alpha P-values of NP theory.I was forced to understand the Fisher vs. Neyman distinction because I was schooled directly from Neyman himself, and even rebuked by him for expressing preference for the Fisherian approach - although his former students on the department faculty at the time - Lehmann, David, and Scott (my advisor) - tried to shield me from his ire. I also appreciate the analogous Bayesian distinction (operational-Bayesian decision theory is the Bayesian analog of NP-Wald decision theory; reference-Bayes theory is the analog of Fisherian reference frequentism) - in fact for two decades I traveled around the world giving Bayesian workshop. I think both these distinctions should be clarified in all statistical training; the frequentist vs. Bayes split is often emphasized but the information-summarization vs. decision split is typically neglected, leading to much confusion in teaching and practice.You also wrote:
It is now clear to me that the statisticians who banned statistical significance, and I don't know who they are besides the three authors of the TAS (2019) editorial, also disparaged the statistical reasoning that preceded Fisher small sample theory and that certainly includes Pearson's large sample theory.
With that you seem to confuse calls to relabel observed "significance levels" as P-values with calls to ban statistical tests and P-values.
P-values are a central statistic in the Pearson-Fisher approach of computing and presenting tail areas of statistics (the "value of P" in Karl Pearson and Fisher) to evaluate statistical models or hypotheses.
A major problem is that many books and tutorials also use "significance level" for the fixed design alpha of NP theory, resulting in widespread misinterpretation of P-values as if they were pre-specified alphas; such misinterpretations lead to profoundly miscalibrated inferences, e.g., see Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62–71.
To mix up calls for careful terminology with calls for bans of methods is a complete and unwarranted confusion that many fall prey to, probably because there are many other authors who do want to drop the Pearson-Fisher methodology from usage, replacing them either with orthodox NP hypothesis tests (e.g., Lakens) or the test inversions called "confidence" intervals (e.g., Rothman), or else replacing them with Bayesian measures such as Bayes factors (e.g., Goodman). Among these extremes I find that it is the nonBayesians who seem to most misunderstand and dismiss Fisher and his use of P-values (his reputation among epidemiologists was badly damaged by his skepticism of the smoking link to lung cancer).
Very few journals have actually enacted any bans, and a cursory examination of prestigious medical journals will show "significance" as code for "p<0.05" is still the dominant convention. The one major improvement that these reform movements have produced is the routine presentation of interval estimates; I hope we would all agree that is good. What is under fierce debate is whether more reform is needed. I and many others say yes, but so far little in the way of further reform has been taken up in practice because there is little agreement on what should be done.
I have promoted "safe" use of divergence (Pearson-Fisher) P-values, taking the baby steps that we be sure to call them P-values rather than "significance levels", call fixed cutoffs "cutoffs" or "alpha levels", and present P-values in continuous form, without reference to a cutoff - the reader can always insert their own cutoff (whether 0.05 or 0.005 or...). Both Lehmann and Cox recommended continuous presentation of P-values, as one could see by careful reading of their textbooks. Yet these proposals have been attacked by orthodox Neyman-Pearsonians and Bayesians alike, with special invective from the NP orthodoxy (for whom I am an apostate or heretic). My response is to take being attacked from both wings as a sign that I am on the right track, and a suggestion that I am hitting a special nerve in exposing an unscientific rigidity and resistance to reform in a statistical orthodoxy.
Again, I have not called for "banning" anything. Instead, following my favorite statistical thinkers (e.g., Box, Cox, Good, Mosteller, Tukey), I call for understanding and carefully justified use of all approaches, along with wariness of confusions that such a toolkit philosophy can engender. For example, we need to be wary of identifying P-values and "confidence" intervals with posterior probability statements (they are often numerically similar or lead to the same decision, but their interpretations differ in important ways), or confusing Fisher's testing philosophy with Neyman's (they sometimes lead to the same numeric result or decision, but again their interpretations differ in important ways).
All this means is that we should teach the information vs. decision distinction just as we do the frequentist vs. Bayesian distinction. Crossing these distinctions leads to a 2x2 table of questions and tools for answering them, with Information-summarization vs. Decision goals on one axis and Calibration vs. Predictive goals on the other. Elaborations to more rows, columns and dimensions will no doubt be needed, but I think that teaching these distinctions is a start toward addressing the practice problems we lament.
Best,
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 07-25-2023 09:23
From: Eugene Komaroff
Subject: hypothesis formulation
Hi Kostas. We met at the Harvard School of Public health when I was a Research Scientist at ACTG. I recall your frustration teaching basic statistics to students in a GEN ED program. At that time, I could not commiserate, but now I feel your pain after formally teaching online and in-person classes on basic statistical practice for the past 13 years. It is very hard to teach the foundational statistical tests and becomes harder when it comes to statistical modeling like mulitple regression, multivariate analysis, and beyond.
Regarding Professor Greenland's speculation about one and two tailed tests: "An historical aside: P-values in some form (not by that name) date back to the early 1700s. They were becoming popular and even hacked by researchers by the 1840s; by the 1880s they started to be linked to the then-new concept of "statistical significance" (see Shafer, G. 2020. On the nineteenth-century origins of significance testing and p-hacking. www.probabilityandfinance.com/). Those were all one-sided P-values however, or at least I know of no reference to 2-sided P-values before Fisher, so I'd be curious if any exist; that they seem to have appeared only after two centuries and had to wait for someone like Fisher to popularize them might raise suspicions that they are less intuitive than the original one-sided P-value formulations."
First, take a look at the title in Gosset's (1908) brilliant, ground breaking logic and method for converting a population standard deviation into a standard error, no doubt under the tutelage of Karl Pearson.
Gosset WS ("Student," 1908). The probable error of a mean. Biometrika 6 (1), 1–25.
Now, here is what Fisher (1973) said about the concept called probable error: "The value of the deviation beyond which half the observations lie is called the quartile distance, and bears to the standard deviation the ratio .67449. It was formely a common practice to calculate the standard error and then, multiplying it by this factor, to obtain the probable error. The probable error is thus about two-thirds of the standard error, and as a test of significance a deviation of three times the probable error is effectively equivalent to one of twice the standard error" (p. 45).
Fisher R.A. (1973). Statistical Methods for Research Workers (14th Ed.). New York: Hafner Publishing. Reproduced in Statistical Methods, Experimental Design and Scientific Inference (1995). New York: Oxford University Press.
Seems to me Professor Greenland wants us to believe that Fisher was nothing more than a social media influencer spreading misinformation. It is now clear to me that the statisticians who banned statistical significance, and I don't know who they are besides the three authors of the TAS (2019) editorial, also disparaged the statistical reasoning that preceded Fisher small sample theory and that certainly includes Pearson's large sample theory.
------------------------------
Eugene Komaroff
Professor of Education
Keiser University Graduate School
Original Message:
Sent: 07-24-2023 18:29
From: Sander Greenland
Subject: hypothesis formulation
Hi Constantine,
Small point perhaps but not trivial, because as you found as I did that it raises a trickiness for teaching.
Before explaining, allow me to correct your example:
There are several ways to define 2-sided P-values. In the simple case of tossing with Pr(heads) = 0.5 and seeing all heads they all yield twice the 1-sided P-value I used for n heads in n tosses: 2-n; call that 1-sided P-value p.
With n tosses all heads, the 2-sided P-value is 2p = 2(2-n) = 2-n+1 whose negative base-2 log is n-1 (not n+1). Thus the S-value from the 2-sided P-value is -log2(p)-1.
With p = 0.05 we get 2p = 0.10 and s = -log2(2p) = -log2(p)-1 = 3.3; for reference, that is between the probabilities of 3 and 4 heads in a row, p(3) = 0.1250 and p(4) = 0.0625.
After some waffling, a few years ago I decided that the most straightforward way of explaining the information content of actual observed P-values was using the all-heads example I posted earlier, as that applies to any input P-value, whether 1-sided or 2-sided or many-sided (like that from a test of model fit): The binary S-value provides one simple measure of the information in that P-value against whatever hypothesis or model is being evaluated. That is so even when adjustments or penalties have been applied to get the actual P-value. The coin-tossing formulation I use converts the actual P-value being evaluated into a 1-sided P-value in a reference experiment on coin tossing. This is exactly as is done in particle physics, in which P-values are converted to the one-sided standard normal cutpoint ("sigma") that would produce them as the upper tail area; here the reference experiment is a single draw from a standard normal distribution.
In that description, the reference point for evaluation of a P-value is not described as the P-value testing fairness (which is 2-sided), but rather as the P-value for testing no loading (bias) for heads, which is one-sided. Fairness, Pr(heads) = 0.5, is used for the reference distribution because it is the closest one can come to bias in favor of heads without having that bias, and it is what people think of intuitively for a reference distribution when testing for loading in either direction (we should be grateful for and take advantage of any time that intuition leads to the correct statistical answer!). I went this route in part because using instead the 2-sided P-value for heads brings in complications that arise from disputes about 1-sided vs 2-sided hypotheses and tests, as reflected in the present thread. It is tricky to finesse those disputes and requires a lot of background to appreciate the details, all of which can be avoided if for a moment one forgets statistical theory (or doesn't have any) and just focuses on the probability of getting all heads in an experiment of n tosses to check for bias toward heads.
An historical aside: P-values in some form (not by that name) date back to the early 1700s. They were becoming popular and even hacked by researchers by the 1840s; by the 1880s they started to be linked to the then-new concept of "statistical significance" (see Shafer, G. 2020. On the nineteenth-century origins of significance testing and p-hacking. http://www.probabilityandfinance.com/). Those were all one-sided P-values however, or at least I know of no reference to 2-sided P-values before Fisher, so I'd be curious if any exist; that they seem to have appeared only after two centuries and had to wait for someone like Fisher to popularize them might raise suspicions that they are less intuitive than the original one-sided P-value formulations.
The problems of interpreting 2-sided P-values can be seen from an information-theory standpoint, where a 1 sided P-value of p = 0.0625 in a coin tossing experiment to check for bias toward heads would become a 2-sided P-value of 2p = 0.1250 from the same experiment, which represents only 3 bits of information against some hypothesis. But which hypothesis? Bias in either direction? Why check the tail direction when we saw all heads? And why this loss of 1 bit of information?
There are several ways to answer these questions depending on one's preference or dislike for directional hypotheses with their 1-sided P-values vs. point hypotheses with their 2-sided P-values.
For those who dislike 1-sided hypotheses and P-values, a direct 2-sided explanation takes 2p as giving the information against the point hypothesis H: Pr(heads) = 0.5, bypassing one-sided derivations. A 2-sided P-value of 0.1250 then represents only s = -log2(0.1250) = -log2(0.0625)-1 = 3 bits of information against that H. But this two-sided P-value arose from 4 heads in a row, which has probability 0.0625 under H. I think the 2-fold discrepancy between the P-value and the probability of what was taken as observed heads in a row is bound to confuse students!
One-sided explanations for the discrepancy can avoid that immediate confusion at a cost of much more sophisticated arguments, as found for example in Cox's writings (SJS 1977) which expressed a preference for thinking of 2-sided P-values as derived by combining two 1-sided tests.
To illustrate the informational view of that combination, first suppose we are given only that the 2-sided P-value 2p = 0.1250, not the direction in which the deviation occured.Then we can only say that one of the directional deviations has p=0.0625 but we don't know which one. With S-values we can say that there is one bit of missing directional information (the sign bit when the boundary point of H is 0, as with the logit of the heads probability). In the coin-tossing example, it is as if we are given only a one-sided p = 0.0625 but not whether that was from all heads or all tails, so we don't know whether the result is information against H: Pr(heads) ≤ 0.5 or against H: Pr(heads) ≥ 0.5. That is a loss of one bit of information.
We do however ordinarily see the direction; in that case Benjamini described the use of 2p as a Bonferroni-type adjustment or penalty for picking the smaller of the two 1-sided P-values. Extending that to S-values, here is what I posted to Komaroff:
suppose the side was not really prespecified and instead the data made the choice; then the two-sided penalty of doubling the smaller of the one-sided P-values corrects for that choice in a way familiar in multiple comparisons and information theory: doubling p results in a decrement of one bit in the surprisal, losing the direction bit (the data information used to make the direction choice): s = -log2(2p) = -log2(p)-1.
I find it satisfying that both preferences lead us to the same answer of one bit for the information loss in going from 1-sided to 2-sided P-values. But again, for basic teaching this all seems to me to be worth bypassing by using the treatment I posted earlier in which the actual P-value being evaluated (regardless of its sidedness) is set equal to the probability of all heads in n tosses 2-n and the equation is solved for n (or more generally, s, the number of bits of information supplied by the actual observed P-value against whatever model was used to compute it).
Best,
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 07-24-2023 10:29
From: Constantine Daskalakis
Subject: hypothesis formulation
Sander,
Always gets fun when you jump in the waters.
I've always wondered about the interpretation of S-values,
Consider a coin-tossing mechanism and take H to be the hypothesis that the mechanism is not loaded (biased) toward "heads". Let p(n) = 2^-n be the P-value for H from seeing n heads in an experimental test of the mechanism comprising n tosses. Then p(4) = 0.0625 and p(5) = 0.03125, placing p = 0.05 closest in evidence to getting 4 heads in 4 tosses. I think most people would appreciate the weakness of such evidence if asked to bet substantial money against H based only on that result.
This exercise can be extended to an observed P-value p for any hypothesis H by converting it to the binary surprisal or S-value s =h -log2(p); p then equals 2-s.
The interpretation seems to correspond to a 1-sided p-value, right? But the example is a typical 2-sided setup, ie, we would be "surprised" if we got either n heads or n tails (in n tosses). So, p(5) = 0.0625 and p(6) = 0.0313, and the S-value is then -log2(p)+1. From an informational theoretical standpoint, then, the 2-sided p-value would seem to carry 5.3 (not 4.3) bits of info against H.
Trivial point, but I've found it a bit tricky to explain to (the rare) attentive students/practitioners.
Regards.
------------------------------
Constantine Daskalakis, ScD
Thomas Jefferson University, Philadelphia, PA
Original Message:
Sent: 07-21-2023 20:43
From: Sander Greenland
Subject: hypothesis formulation
Thank you Michael for the kind words and for the citation to exceedance intervals (of which I had not been aware).
On quick glance the TAS paper by Brian Segal looks interesting, albeit demanding. Has Brian has followed it up with a more elementary primer illustrated with some toy examples and a simple but real application? Such a primer would help get the method into use. If a primer exists or is forthcoming, please let us know where it is posted.
I did spot one small aspect of the paper that I would alter: On p. 130 it stated "For point null hypotheses, Bayes factors tend to be more conservative, that is, Bayes factors provide less evidence against the null hypothesis than p-values..." I see this kind of comment often, and I think it is misattributing a property of observers to a mere number obtained from a computation. P-values do not overstate evidence against the hypothesis H from which they are computed; rather, people overstate the evidence against H that p = 0.05 represents, thanks to the entrenchment of the 0.05 cutoff as a criterion for "significance". Bayes factors merely provide one way of seeing how little evidence p=0.05 represents.
A straightforward non-Bayesian way of seeing that point uses an old teaching exercise:
Consider a coin-tossing mechanism and take H to be the hypothesis that the mechanism is not loaded (biased) toward "heads". Let p(n) = 2^-n be the P-value for H from seeing n heads in an experimental test of the mechanism comprising n tosses. Then p(4) = 0.0625 and p(5) = 0.03125, placing p = 0.05 closest in evidence to getting 4 heads in 4 tosses. I think most people would appreciate the weakness of such evidence if asked to bet substantial money against H based only on that result.
This exercise can be extended to an observed P-value p for any hypothesis H by converting it to the binary surprisal or S-value s = -log2(p); p then equals 2-s. When s is an integer n, p equals the aforementioned p(n) from seeing n heads in the experiment with n tosses. The S-value s can also be seen as a measure of the Shannon information against H that the P-value conveys; the units of s correspond to bits of information, and p=0.05 represents only about s=4.3 bits of information against H. For contrast, p=0.005 represents 7.6 bits, and the one-sided 5-sigma criterion for "discovery" (rejection of a null H) in particle physics corresponds to about 22 bits against H, or 22 heads out of 22 tosses.
My colleagues and I have found this conversion of P-values to coin tosses and surprisals to be very useful in stemming common overinterpretations of P-values. Thus, in addition to background theoretical papers justifying and elaborating the usage, we have published a number of introductory treatments for various fields, including (among others)
Rafi, Z., Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20, 244. doi: 10.1186/s12874-020-01105-9, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9,
online supplement at https://arxiv.org/abs/2008.12991
Cole, S.R., Edwards, J., Greenland, S. (2021). Surprise! American Journal of Epidemiology, 190, 191-193. https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593
Amrhein, V., Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values.
Journal of Information Technology, 37(3), 316-320. https://journals.sagepub.com/doi/full/10.1177/02683962221105904
I will look forward to a similar basic introduction to exceedance probabilities.
Best,
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 07-21-2023 11:36
From: Michael Elliott
Subject: hypothesis formulation
A very interesting discussion and, as always, I learn a great deal from Prof. Greenland's writings.
I thought I would use this opportunity to put in a short plug for my former student Brian Segal's work (entirely his own) on exceedence intervals, which are confidence intervals for the probability that a parameter estimate will exceed a specified value in an exact replication study. The idea has its roots in a Bayesian posterior predictive distribution setting, although the development is entirely frequentist. Although no statistical method is a panacea, I thought this approach deserves more attention that it has received thus far.
------------------------------
Michael Elliott
University of Michigan
Original Message:
Sent: 07-20-2023 12:56
From: Sander Greenland
Subject: hypothesis formulation
Dear Eugene Komaroff:
There is a huge literature on testing interval hypotheses in practice, including textbooks; some key references are given in the articles I cited earlier, in Wellek (Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC, 2010, which provides real examples of application), and in the Wikipedia entry on equivalence tests.
Any common one-sided P-value for the constraint θ ≤ r (i.e., θ is in the half interval bounded by r) will provide a valid (size ≤ α) NP decision rule or test of H: θ ≤ r by comparing the P-value to α. There are several straightforward adaptations of familiar tests that arise from conjunctions or disjunctions of such one-sided hypotheses. Among them are noninferiority and superiority tests, which test special one-sided hypotheses; minimum-important difference (MID) tests, which test whether θ is inside an interval of radius r around 0, H: -r ≤ θ ≤ r (the intersection of the half interval above -r and the half interval below r); and equivalence tests, which are actually tests of nonequivalence in that their test hypothesis is that θ is outside the interval, H: θ ≤ -r or r ≤ θ (the union of the half interval below -r and the half interval above r).
A view shared by many familiar with this literature is that such tests are long overdue for incorporation into basic training. One reason is that they help prevent common misinterpretations of conventional point-hypothesis tests of the sort described in
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf
As the Wiki entry states, equivalence tests may "prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect. Furthermore, equivalence tests can identify effects that are statistically significant but practically insignificant, whenever effects are statistically different from zero, but also statistically smaller than any effect size deemed worthwhile."
Interval tests also have many important applications. Although they go back at least to 1954 when Hodges & Lehmann introduced a general method for NP testing of interval hypotheses, they began to get close attention from applied statisticians in the 1970s when the aforementioned interval hypotheses arose in the biopharmaceutical literature. As the Wiki entry states: "Equivalence tests were originally used in areas such as pharmaceutics, frequently in bioequivalence trials. However, these tests can be applied to any instance where the research question asks whether the means of two sets of scores are practically or theoretically equivalent. As such, equivalence analyses have seen increased usage in almost all medical research fields. Additionally, the field of psychology has been adopting the use of equivalence testing...equivalence tests have recently been introduced in evaluation of measurement devices,[7][8] artificial intelligence[9] as well as exercise physiology and sports science.[10] Several tests exist for equivalence analyses; however, more recently the two-one-sided t-tests (TOST) procedure has been garnering considerable attention. As outlined below, this approach is an adaptation of the widely known t-test."
My applied area is health and medical research where these interval-hypothesis methods are sorely needed. I am unfamiliar with the field of education but I would guess it is akin to psychology in having good use for interval methods; if so I shall hope you can get them incorporated into basic educational statistics if they are not already there.
As for the distinction between P-values from NP tests of intervals and Fisherian P-values for divergences from intervals, that is taken up at length in the Greenland 2023 SJS article. Briefly, divergence P-values are a type of summary description of how data diverge from the region of expectations that conform perfectly to a hypothesis H. These summary divergences take on familiar forms such as squared Z-statistics, squared t-statistics, chi-squared statistics, and likelihood-ratio statistics. Suppose for example that μ is a normal (Gaussian) mean, and H defines a simple closed interval around 0, H: -r ≤ μ ≤ r, as in an MID problem. The divergence statistic d for H is then the squared distance of the sample mean m from the interval divided by the standard error of m; thus d=0 when -r ≤ m ≤ r (i.e., when the sample mean conforms perfectly to H). The divergence P-value for H is the largest two-sided P-value for all means μ that are in the interval (i.e., it is the maximum two-sided P-value over all H: μ = c where -r ≤ c ≤ r). Thus if m is in the hypothesized interval (-r ≤ m ≤ r) the divergence P-value will be 1, because the two-sided P-value for H: μ = m is 1. In contrast, if the interval is many standard errors wide and m falls on an interval boundary (m=-r or m=r), the UMPU (Hodges-Lehmann) P-value from NP testing of the interval will approach 0.5. As an extreme case, the P-value from the NP-test of H: μ ≤ r is the ordinary one-sided P-value, which is always strictly less than 1 and is 0.5 when m=r; whereas the divergence P-value for the same H: μ ≤ r equals 1 whenever m ≤ r.
Best,
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 07-20-2023 07:44
From: Eugene Komaroff
Subject: hypothesis formulation
P-values are continuous random variables; therefore perfectly sensible to talk about the probability density function of a p-value distribution. In Fisher's time the probability of a p-value was derived by integration over an infinitesimally small interval as an area under the standard normal curve. The limits of integration is the only interval that makes sense in a discussion about the meaning of a p-value.
In the T-test procedure in SAS, there is an option: H0 = m where m can be any specific parameter value – it does not have to be zero. However, H0 <= m or H0 => m option is not an option, so impossible to run such a test in practice. Although apparently fun to think about in theory with words and statistical notation. If you know of a practical way to test a null hypothesis parameter that is defined by an interval, please share.
------------------------------
Eugene Komaroff
Professor of Education
Keiser University Graduate School
Original Message:
Sent: 07-19-2023 12:48
From: Sander Greenland
Subject: hypothesis formulation
You might find of interest this recent article with discussion and rejoinder (unfortunately, all printed piecemeal; sorry I don't have the DOIs for the discussant contributions but they are cited in the rejoinder):
Greenland S (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice. Scandinavian Journal of Statistics, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12625
Rejoinder to discussants: Greenland S (2023). Connecting simple and precise p-values to complex and ambiguous realities. Scandinavian Journal of Statistics, https://doi.org/10.1111/sjos.12645
While all that may be much too involved for the present discussion, I think the point in the main title is relevant here. That point (with which all the journal discussants agreed) was that there are two logically and mathematically distinct ways of conceptualizing, defining, deriving and interpreting P-values. For simple point hypotheses the two coincide numerically, and hence are usually not distinguished and are even thought to be identical concepts. The two conceptualizations can nonetheless lead to different P-values when the tested hypothesis specifies that the distribution generating the data is in a model subspace defined in part by inequalities, as with interval hypotheses (as one-sided hypotheses are often formulated).
In the present discussion, I have the sense that some of the consternation reflects a clash between intuitions arising from the separate conceptualizations. Thus, while the following formulation is far removed from elementary statistics, I think that it could explain the differences among views of point hypotheses and one-sided hypotheses.
The first type of P-value corresponds to a geometric treatment of chi-squared tests of model families as introduced by Karl Pearson (1900), and later adopted for point hypotheses by R.A. Fisher. This P-value is simply the ordinal location in a reference distribution of a measure of divergence between the data and the hypothesized model subspace. In this conceptualization there is no mention or use of error types; the P-value simply serves as part of a description of the sample discrepancy from what would be expected under the nearest distribution in the hypothesized model subspace.
The second type of P-value is arises from "optimal" Neyman-Egon Pearson (NP) decision (hypothesis testing) rules; it the minimum alpha level at which rejection of the model subspace can be declared. Error control over repeated sampling (rather than description of a sample discrepancy from an expectation) is the paramount consideration. A consequence of this focus can be a type of incoherent single-sample property of UMPU (Hodges-Lehmann, HL) P-values for interval hypotheses, as described by Schervish (TAS 1996) - a problem not shared by divergence P-values. For interval hypotheses, the summary divergence P-value can be as much as twice the UMPU P-value; this difference can appear dramatic when the two P-values straddle a sharp cutoff (e.g., if the divergence p = 0.06 but the decision p = 0.03, and alpha = 0.05), but is quite small in information-theoretic terms (representing at most one bit of information difference).
------------------------------
Sander Greenland
Department of Epidemiology and Department of Statistics
University of California, Los Angeles
Original Message:
Sent: 07-14-2023 11:49
From: James Hawkes
Subject: hypothesis formulation
Sometime a little after 2000 introductory stat books started changing the null hypothesis to a strict equality and the alternative was always strictly >, <, or not equal. Before 2000 most intro stat books used the opposite inequality in the null when the alternative was expressed as > or <. Aside from the fact that the strict equality in the null is the "worst case" for the null, are there any other reasons underlying this change. I would appreciate if someone could point me to any published discussion on this topic. Also, would appreciate hearing any thoughts on the subject.
Thanks
Jim Hawkes
------------------------------
James Hawkes
Retired
------------------------------