First, I agree with Edith's comment that there is no such thing as a 'true' p-value. No statistics book ever posited such a thing. There are population mean differences and there are population variances. p-values are estimated from samples. I'll come back to population values of p-values in a bit.
Second, variability estimates for small samples ARE adjusted for by the t-test or other statistics. Just look at the difference between the t-test critical values (when one estimates variability) and the z-score critical value (when the variability is known). Using the ubiquitous 0.05 two-sided, the t-test has an asymptote at 1.96, which it approaches when N~30, with 6 d.f. the critical t-test is 2.45.
Third, when N is small or large, a p=0.01 says that the means are quite likely to be different, the difference is not zero. Both results indicate 'statistical significance'. End of discussion.
Fourth, there are (and I hate to be so pejorative) ignorant users of statistics who confuse p-values and effect size. Effect size can be simply measured by (mean difference)/standard deviation. It is a sufficient statistic in all power analyses, along with alpha and beta to estimate N. Effect size is independent of sample size (unlike the standard error or p-values [see below]), although its confidence interval is directly affected by N. Given a sample size of 20 or 500, we can ask a power analysis (using a power of 50%), what treatment difference would yield a significant (at 0.01 two-sided) result. When the variances (sd) are set to a standardized 1.0, the mean difference (effect size) is 0.852 when N=20 and 0.163 when N=500. What do we conclude? When the sample size is small, the means would be almost a standard deviation different - a large treatment difference. When the sample size is large, the means need only to be trivially different (0.16) to achieve statistical significance.
Conclusion: If two studies with different N's (N=20 or 500) had the sample p-values (0.01), the SMALLER study indicates a VERY LARGE CLINICAL DIFFERENCE, an important finding. However, the CI on this difference will be quite large. The larger study has a smaller CI, but indicates a trivial clinical difference. I would tell my client that the small study indicates a real effect which might be quite large and worthy of future follow-up studies and the large study indicates a real effect, which is likely quite small and
not worth further investing. [Although if the small effect size is the ONLY treatment available, then I'd recommend they do a huge Phase IIIb trial.]
Fifth: To return to population estimates of p-values, we need to understand what we are talking about - the null hypothesis. We all should know that the null hypothesis is that the treatment difference is zero (or mean difference minus a constant is zero). The difference could be any statistic (e.g., distribution, variability), but I'll focus on mean differences. Let me ask a general question: Can anyone think of any research question which any scientist/research ever believed has an observed mean difference of EXACTLY zero? Let me operationalize that, the observed difference in a huge, huge study observed a difference smaller than 1/10^1,000,000,000,000,000,000,000. Let me further elaborate on 'huge', a study with not 500, not 10,000, not a million, billion, trillion, but even larger sample size (e.g., centillion). While we act like we are testing if the difference is exactly zero, in practice no difference in treatments is EXACTLY zero. Think of the number line, with it measuring the difference of two different variable treatments. What is the likelihood that the difference of these two variables is ever a single point of zero? It may be practically quite small but the difference between two variables is almost never a single value of zero.
When the difference, albeit small, is unlikely to be exactly zero, then the t-test (hence p-values) is a function of (square root of) N/group. Let me illustrate this with a very small effect size (0.10) with the common 2-sided 0.05 alpha level comparing two independent samples. With a small sample size of 4, the p-value is 0.90. When N/group=92 the p-value is 0.50, when N/group=771 the p-value is 0.05, when N/group=3,036 the p-value is 0.0001.
In sum, when any non-zero difference exists, p-values could be anything. As N increases any non-zero difference will become statistically significant, using any level of significance (< 0.05, < 0.01, ... , < 0.000000000000001).
Sixth: p-values only answer if the difference is not zero. It COMPLETELY ignores the most important question, what is the difference? Is the difference (clinically) meaningful? If one understands the metric, that answer can only be obtained from the CI of the difference. We all should know that if the p-value is < 0.05 then the 95% CI will not include zero. The CI is how we understand the magnitude of the difference, NOT THE P-VALUE. If we don't understand the metric of the parameter, I recommend computing the CI on the effect size (using the non-centrality parameter).
-------------------------------------------
Allen Fleishman
Allen Fleishman Biostatistics Inc.
-------------------------------------------