Over the years Major League Baseball has tweaked the dimensions of the field, specifically the distance and height of the pitcher’s mound and the area of the strike zone. They did this in order to adjust the balance between pitchers and hitters, mostly to shift the balance toward hitters to make games more exciting for the fans.
Scientists are debating similar tweaks to statistical significance, to adjust the balance between false positives and false negatives. As with pitchers and batters, some changes are a zero-sum game – if you lower false positives, you increase false negatives, and vice versa. Where the perfect balance lies is a complicated question and the increasing subject of debate.
A recent paper (available in preprint) by a long list of authors, including some heavy hitters like John P.A. Ioannidis, suggests that the p-value that is typically used for the threshold of statistical significance, be changes in the psychology and biomedical fields from 0.05 to 0.005. They write:
For fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called “significant” but do not meet the new threshold should instead be called “suggestive.” While statisticians have known the relative weakness of using P≈0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new, a critical mass of researchers now endorse this change.
The p-value is defined as the probability of the results of an experiment deviating from the null by as much as they did or greater if the null hypothesis is true. If that sounds difficult to parse, don’t feel bad. Many scientists cannot give the correct technical definition. To put it more simply, what are the odds that you would have gotten the results you did (or greater) if your hypothesis is not true? In medicine this usually refers to an effect, such as the difference in pain reduction between a placebo and an experimental treatment. Is that difference statistically significant? A p-value of 0.05, the traditional threshold, means that there is a 5% chance that you would have obtained those results without there being a real effect. A p-value of 0.005 means there is a 0.5% chance – or a change from 1/20 to 1/200.
There are major problems with over-reliance on the p-value. It was never intended to be the one measure of whether or not an effect is real, but unfortunately the human desire for simplicity has pushed it into that role. Also, people tend to flip its meaning – interpreting it as the odds that the effect is real (rather than the data being what it is). This reversal of meaning, however, is not valid for many reasons. A study with a p-value of 0.05 does not mean that there is a 95% chance the effect is real. There could still be only a tiny probability the effect is real, depending on other factors.
Placing so much importance on the p-value has also demonstrably led to what is called p-hacking. There are subtle (and sometimes not-so-subtle) ways in which researchers can bias the outcome of a study to cross over the magical threshold of 0.05, declare their results significant and get them published. This has further led to a problem with reproducing research and nothing short of flooding the literature with a mass of dubious studies.
Essentially the authors of the new editorial are pointing out that the balance between false positives and false negatives has drifted far from an optimal point. Over the years researchers have figured out how to game the p-value. Combined with tremendous pressure to publish positive findings, and the usual biases we all have, this has led to a glut of preliminary findings that are mostly false positives.
What the authors propose would certainly shift the balance away from false positives. It is a straightforward fix, but I have concerns that it may not be optimal, or even enough by itself. I do like their suggestion that we consider 0.005 to be statistically significant, and anything between 0.05 and 0.005 to be “suggestive.” This is closer to the truth, and would probably help shift the way scientists and the public think about p-values. I have already made this mental shift myself. I do not get excited about results with a p-value near 0.05. It just doesn’t mean that much.
The downside, of course, is that this will increase the number of false negatives. Given how overwhelmed the literature is with false positive studies, however, I think this is a good trade-off. Further, the p-value threshold is not the only variable. The authors suggest that you could increase the size of a study by 70% to essentially keep the false negative rate where it is. In this way research is not entirely a zero-sum game. You can decrease both false positives and false negatives by increasing the study size, or the power of the study.
While true, this can be difficult for many researchers, especially those with marginal funding, such as young researchers. For rare diseases or questions for which it is difficult to recruit patients, even despite good funding, it could be hard to get to the needed numbers to achieve p < 0.005. But to this I say – so what? You can still do your small study and if you get marginal p-values you can even still publish. Just don’t call your results “significant.” Call them “suggestive” instead.
There may be unintended consequences to this change, but given the huge problem with false positive studies I say we make the change and see what happens. We can always do further tweaks if necessary.
Also, I don’t want the focus on where to set the p-value to distract from the deeper question of how useful the p-value is at all. Some journals have gone as far as banning p-values entirely, in favor of other methods of statistical analysis. I think this is draconian, but they have the right idea, to put p-values in their place.
For example, effect sizes are extremely important, but often neglected. More important than the p-value is some measure of the signal to noise ratio. What is the effect size compared to what is being measured and the uncertainty of the outcome? Further, Bayesian analysis can be very useful. A Bayesian analysis asks the actual question most researchers think they are asking – what is the probability of my hypothesis given this new data? In a Nature commentary on this issue they suggest that many researchers don’t have the statistical chops to do a Bayesian analysis. Again I say, so what? The answer is to improve the statistical chops of the average researcher.
That is, in fact, the inherent problem here. Many researchers do not understand the limitations of the p-value, or they succumb to the temptation of relying heavily on this one measure because it is the fast track to statistical significance and publication. Many also do not fully understand the nature of p-hacking and how to avoid it. We need more sophistication in the minimal acceptable statistical analysis and methodology of medical research.
All indications are that the balance has shifted unacceptably to the false positive. We probably need a thorough culture change within the medical research community – publishing fewer but more rigorous studies, and clearly indicating which studies are preliminary, and further eliminating publication and citation bias against negative results.
This one shift in the threshold for statistical significance won’t be enough, but I do think it is a move in the right direction.