Let's Set Half a Percent as the Standard for Statistical Significance


My many-times-over coauthor Dan Benjamin is the lead author on a very interesting short paper "Redefine Statistical Significance." He gathered luminaries from many disciplines to jointly advocate a tightening of the standards for using the words "statistically significant" to results that have less than a half a percent probability of occurring by chance when nothing is really there, rather than all results that—on their face—have less than a 5% probability of occurring by chance. Results with more than a 1/2% probability of occurring by chance could only be called "statistically suggestive" at most. 

In my view, this is a marvelous idea. It could (a) help enormously and (b) can really happen. It can really happen because it is at heart a linguistic rule. Even if rigorously enforced, it just means that editors would force people in papers to say "statistically suggestive for a p of a little less than .05, and only allow the phrase "statistically significant" in a paper if the p value is .005 or less. As a well-defined policy, it is nothing more than that. Everything else is general equilibrium effects.

I previewed the paper and some of why tightening the standards for statistical significance could help enormously in "Does the Journal System Distort Scientific Research?" In the last few years, discipline after discipline has faced a "replication crisis" as results that were considered important could not be backed up by independent researchers. For example, here are links about the replication crisis in five disciplines:

Here is a key part of the argument in "Redefine Statistical Significance":

Multiple hypothesis testing, P-hacking, and publication bias all reduce the credibility of evidence. Some of these practices reduce the prior odds of [the alternative hypothesis] relative to [the null hypothesis] by changing the population of hypothesis tests that are reported. Prediction markets and analyses of replication results both suggest that for psychology experiments, the prior odds of [the alternative hypothesis] relative to [the null hypothesis] may be only about 1:10. A similar number has been suggested in cancer clinical trials, and the number is likely to be much lower in preclinical biomedical research. ...

A two-sided P-value of 0.05 corresponds to Bayes factors in favor of [the alternative hypothesis] that range from about 2.5 to 3.4 under reasonable assumptions about [the alternative hypothesis] (Fig. 1). This is weak evidence from at least three perspectives. First, conventional Bayes factor categorizations characterize this range as “weak” or “very weak.” Second, we suspect many scientists would guess that P ≈ 0.05 implies stronger support for [the alternative hypothesis] than a Bayes factor of 2.5 to 3.4. Third, using equation (1) and prior odds of 1:10, a P-value of 0.05 corresponds to at least 3:1 odds (i.e., the reciprocal of the product 1/10 × 3.4) in favor of the null hypothesis!

... In biomedical research, 96% of a sample of recent papers claim statistically significant results with the P < 0.05 threshold. However, replication rates were very low for these studies, suggesting a potential for gains by adopting this new standard in these fields as well.

In other words, as things are now, something declared "statistically significant" at the 5% level is much more likely to be false than to be true. 

By contrast, the authors argue, results declared significant at the 1/2 % level are at least as likely to be true as false, in the sense of being replicable about 50% of the time in psychology and about 85% of the time in experimental economics:

Empirical evidence from recent replication projects in psychology and experimental economics provide insights into the prior odds in favor of [the alternative hypothesis]. In both projects, the rate of replication (i.e., significance at P < 0.05 in the replication in a consistent direction) was roughly double for initial studies with P < 0.005 relative to initial studies with 0.005 < P < 0.05: 50% versus 24% for psychology, and 85% versus 44% for experimental economics.

What about the costs of a stricter standard for declaring statistical significance? The authors of "Redefine Statistical Significance" write:

For a wide range of common statistical tests, transitioning from a P-value threshold of [0.05] to [0.005] while maintaining 80% power would require an increase in sample sizes of about 70%. Such an increase means that fewer studies can be conducted using current experimental designs and budgets. But Figure 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises. Increasing sample sizes is also desirable because studies with small sample sizes tend to yield inflated effect size estimates, and publication and other biases may be more likely in an environment of small studies. We believe that efficiency gains would far outweigh losses.

They are careful to say that in some disciplines, even the half-percent standard for statistical significance is not strict enough:

For exploratory research with very low prior odds (well outside the range in Figure 2), even lower significance thresholds than 0.005 are needed. Recognition of this issue led the genetics research community to move to a “genome-wide significance threshold” of 5×10^{-8} over a decade ago. And in high-energy physics, the tradition has long been to define significance by a “5-sigma” rule (roughly a P-value threshold of 3×10^{-7} ). We are essentially suggesting a move from a 2-sigma rule to a 3-sigma rule.

Our recommendation applies to disciplines with prior odds broadly in the range depicted in Figure 2, where use of P < 0.05 as a default is widespread. Within those disciplines, it is helpful for consumers of research to have a consistent benchmark. We feel the default should be shifted.

To me, one of the biggest benefits of this shift might be a greater ability for people to publish results that do not reject the null hypothesis at conventional levels. These results too, are an important part of the evidence base. The authors of "Redefine Statistical Significance" are careful to say that people should be able to publish papers that have no statistically significant results:

We emphasize that this proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive evidence. We should reward quality and transparency of research as we impose these more stringent standards, and we should monitor how researchers’ behaviors are affected by this change. Otherwise, science runs the risk that the more demanding threshold for statistical significance will be met to the detriment of quality and transparency.

I myself was shocked when I read my own words above on the screen:

... people should be able to publish papers that have no statistically significant results: ...

That it seems shocking to say a paper should be publishable with no statistically significant results is a symptom of how corrupt the system has become. A stronger standard of statistical significance is needed in order to fight that corruption, both by making results that are declared statistically significant more likely to be true and by making results that are not declared statistically significant more publishable.


Update: Also useful is this article by Valentin Amrhei, Fränzi Korner-Nievergelt and Tobias Roth on "significance thresholds and the crisis of unreplicable research."