I agree with this test (for paired differences) being pesky. I thinks it's annoying that the test is often wrongly described, especially in texts with the "cookbook" approach ("use this test, if your data are non-normal" - "this test is about differences in medians"). Ironically, Wilcoxon assumed a normal distribution of the differences in his original paper.
I happened to spend (too) much time on the logic of this test in the past. Here's what I remember:
The test statistic is based on an ordering of the absolute values of the differences. Then, in this ordering we replace the absolute values by + or - , depending on the sign of the differences. This results in a sequence of + and - signs. Now we assume that every possible sequence of + and - signs is equally likely (assumption A).
After that, sums of signed ranks and p-values are calculated. The null hypothesis for the two-sided test is this assumption A, technically speaking.
Which distributions satisfy assumption A? Clearly, continuous distributions that are symmetric around zero. (For discrete distributions with non-zero probability for 0, we would have to exclude the 0s from the data.) Are there other (asymmetric) distributions that satisfy assumption A? I don't know, probably not, but maybe one can construct some weird distribution with this property. Anyway, the usual formulation of the null hypothesis is therefore "Differences have a symmetric distribution with mean zero." Because of the symmetry, this is equal to "differences have a symmetric distribution with median zero".
If the null is not true, this means "differences do not have a symmetric distribution with mean zero". What does this mean? It could be
a) Differences have a symmetric distribution, but not around zero
b) Differences have mean zero, but are not symmetric
c) Differences have median zero, but are not symmetric
d) Differences have mean different from zero and are not symmetric
e) Differences have median different from zero and are not symmetric
f) Differences have median different from zero and mean different from zero and are not symmetric.
Pretty inconclusive. So we pretty much don't know anything, when the null is rejected.
It gets worse, if we want to have an interpretation in the before - after framework. If we think that differences are not symmetric about zero, does this say something about "number of positive before - after comparisons"? I'd argue: nothing at all.
Ok, so we could raise a flag: Only use this test, if we can assume (beyond a reasonable doubt) that differences are symmetric. Then the (additional) null hypothesis would simply be "differences have mean (or median) 0". If we reject the null, we conclude that the differences have a mean (and median) different from zero.
Since the difference of means is the mean of the differences, this results in "mean before" and "mean after" being different. However, we cannot safely conclude that "median before" and "median after" are different, because the median of the differences is not equal to the difference of the medians.
Unless we also assume that the before values and the after values are symmetrically distributed, which does not follow from the symmetry of the differences. Or unless we assume that the shapes of the distributions of the before values and the after values are exactly the same.
These are a lot of assumptions for a test that is often sold with a tag "no assumptions needed".
Now for the one-sided test. If we include symmetry of the differences in the null hypothesis, I'd argue again that a rejection of the null doesn't tell us anything useful.
So we state the assumption "differences have a symmetric distribution" outside of the null. Then, the null might be "mean of differences < = 0". Rejection of the null then means "mean of differences > 0". Because of symmetry, this is equivalent to "median of differences > 0". Again, this is not equivalent to "difference of medians > 0".
I'd agree with you that this means "the majority of after-scores were an improvement". But it all depends on the assumption of symmetry of differences, which is an assumption outside of the null.
My own question is: if the data are symmetrically distributed, the sample mean approaches normality rather quickly. So why is Wilcoxon signed-rank test preferable to the t-test for moderate sample sizes? If the differences are non-symmetric, neither Wilcoxon signed-rank nor the t-test are appropriate. If the differences are symmetric, the sample mean is approximately normal in moderate sample sizes, so the t-test should be approximately fine. (And if the sample size is 5 or 10, should we really use inferential statistics at all?)
------------------------------
Hans Kiesl
Regensburg University of Applied Sciences
Germany
------------------------------
Original Message:
Sent: 06-22-2021 13:37
From: Isabella Ghement
Subject: The pesky Wilcoxon signed-rank test and its interpretation
Hi everyone,
In an unrelated matter, I must confess that I am stumped by how to formulate hypotheses for a directional Wilcoxon signed-rank test and how to interpret the ensuing results.
I am using R's function wilcox.test() function to analyze data collected in a before-after study. The function syntax that I used is like this:
wilcox.test(x = before_scores, y = after_scores, alternative = "greater", mu = 0, paired = TRUE,
exact = NULL, correct = TRUE, conf.int = TRUE, conf.level = 0.95)
As an example:
set.seed(146)
before_scores <- rnorm(n = 10, mean = 20, sd = 3)
after_scores <- rnorm(n = 10, mean = 24, sd = 3)
wt <- wilcox.test(x = after_scores, y = before_scores,
alternative = "greater", mu = 0, paired = TRUE,
exact = NULL, correct = TRUE,
conf.int = TRUE, conf.level = 0.95)
wt
> wt
Wilcoxon signed rank exact test
data: after_scores and before_scores
V = 50, p-value = 0.009766
alternative hypothesis: true location shift is greater than 0
95 percent confidence interval:
1.014856 Inf
sample estimates:
(pseudo)median
4.010822
In my mind, I thought I could set up the null and alternative hypotheses like this:
Ho: The after scores are NOT generally greater than the before scores.
Ha: The after scores ARE generally greater than the before scores.
so that, when I reject Ho, I could state that the data provide evidence that the after scores are generally greater than the before scores,
But then I looked at R's help file for wilcox.test() and got all confused:
If only x
is given, or if both x
and y
are given and paired
is TRUE
, a Wilcoxon signed rank test of the null that the distribution
of x
(in the one sample case) or of x - y
(in the paired two sample case) is symmetric about mu
is performed.
How do I translate this into directional hypotheses? For non-directional hypotheses, I would state:
Ho: The distribution of after - before scores IS symmetric about 0;
Ha: The distribution of after - before scores IS NOT symmetric about 0.
But what I really want to know is whether there has been an improvement in before scores compared to after scores. So, for directional hypotheses, can I state something like:
Ho: The majority of after - before scores were less than or equal to 0.
Ha: The majority of after - before scores were greater than 0 (aka the majority of after scores were an improvement over their before counterparts)?
I am thinking that a directional Ha would state that the distribution of "after - before" differences is asymmetric about zero with the majority of that distribution sitting above 0 but I am not sure that is how a directional hypothesis for the Wilcoxon signed-rank test should actually be stated. Any clarity on this would be greatly appreciated.
Many thanks,
Isabella
------------------------------
Isabella R. Ghement, Ph.D.
Ghement Statistical Consulting Company Ltd.
E-mail: isabella@ghement.ca
Phone: 604-767-1250
------------------------------