The greatest strength of science is that it is self-critical. Scientists are not only critical of specific claims and the evidence for those claims, but they are critical of the process of science itself. That criticism is constructive – it is designed to make the process better, more efficient, and more reliable.
One aspect of the process of science that has received intense criticism in the last few years is an over-reliance on P-values, a specific statistical method for analyzing data. This may seem like a wonky technical point, but it actually cuts to the heart of science-based medicine. In a way the P-value is the focal point of much of what we advocate for at SBM.
Recently the American Statistical Association (ASA) put out a position paper in which they specifically warn against misuse of the P-value. This is the first time in their 177 years of existence they have felt the need to put out such a position paper. The reason for this unprecedented act was their feeling that abuse of the P-value is taking the practice of science off course, and a much needed course correction is overdue.
What is a P-value?
A P-value is a specific statistical analysis of data which addresses the following question: what is the probability that the data would be at least as extreme as it is given the null hypothesis (the assumption that there is no real difference between the groups or phenomena being compared)? One of the problems with the P-value is that few scientists and practitioners who use science can give a thoroughly accurate description of what it actually is.
The P-value was developed by Ronald Fisher in the 1930s as a shortcut to determine when data should be taken seriously. It was intended as an understandable statistical method, and unfortunately it worked too well. People, even scientists, crave simplicity, and so the P-value came to be used far beyond its intentions, and is now erroneously and simplistically seen as the one measure of whether or not a hypothesis is likely to be true.
It is like thinking that megapixels is the only thing you need to know about a digital camera in order to understand its quality, or that processor speed is the only number you need to know about a computer.
P-values range from 0 to 1; the lower the number, the lower the probability that the data is a result of chance alone. Traditionally a P-value of 0.05 is the threshold for determining that results are robust enough to be published. This threshold, of course, is entirely arbitrary.
The problem with P-values
The main problem with P-values is that people use them as a substitute for a thorough analysis of the overall scientific rigor of a study. If the P-value creeps over the 0.05 level then people assume the hypothesis is likely to be true – but that is not what the P-value means.
In essence, it is common to turn the P-value on its head, interpreting it as meaning the probability that the hypothesis is true, rather than the probability of the data given the null hypothesis. Let me quote now from the ASA:
Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold. The ASA statement is intended to steer research into a ‘post p<0.05 era.’
Further, an accompanying press release states:
Good statistical practice is an essential component of good scientific practice, the statement observes, and such practice “emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.”
Those statements are actually a reasonable summary of the principles of science-based medicine.
Let me expand upon the need for “complete reporting.” The ASA is specifically recommending full transparent reporting by scientists of all statistical and data analyses that they performed in doing their study. This is essential, because the P-value assumes that the data being analyzed is the only data created in the study and that it was analyzed in only one way.
Looked at another way, every time a researcher makes another measure, or a different comparison, makes any decision about how to gather and analyze data, or includes data from a prior observation, they are giving themselves another roll of the statistical dice. The P-value of 0.05 becomes entirely meaningless if everything the researchers did is not taken into consideration.
In 2012 Simmons, Nelson, and Simonsohn demonstrated this nicely with what they called, “Researcher degrees of freedom.” They showed how easy it is to manufacture P-values <0.05 just by exploiting common choices researchers make, choices that are often not reported. This phenomenon has been called p-hacking.
P-hacking appears to be pervasive, and is partly to blame for the fact that many published results have a hard time being replicated. Exact replications tend to reduce degrees of freedom, because many of the choices have already been made, and so the p-hacked results vanish.
The ASA offers the following six principles as a fix to the current “P-value crisis”:
1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
To get back to my digital camera analogy, it would be nice to also know how big the CCD is, the size and quality of the lens, whether or not is has a shutter, the capacity and speed of the memory, the features offered in the menu, and the overall quality of the pictures. Megapixels alone do not tell you much.
Likewise, when evaluating a scientific study you should look at the overall rigor of the design and execution: was it properly blinded, was the blinding measured and successful, what was the effect size compared to the variability (signal to noise ratio), was the effect clinically significant and important, what was the dropout rate, are the comparisons fair and reasonable, were the statistics properly done, and what was the power of the study?
But you are also just getting started – how plausible is the hypothesis, what did other similar studies find, have the results ever been independently replicated, and are there any systematic reviews?
For researchers and journal editors this also means requiring that more than just P-values are reported. How about effect sizes, and disclosure of every variable measured and comparison considered. Additional statistical analysis, such as a Bayesian analysis, would also supplement the P-value and put the data into a much more thorough context.
Basically what the ASA and others are recommending is that scientists do what we do here at SBM – thoroughly analyze every aspect of a scientific claim to come up with the best conclusion we can about whether or not a claimed phenomenon is likely to be real. P-values are one tiny aspect of that analysis – a highly overrated aspect.
Some journals have gone so far as to ban the use of P-values in papers they publish. I agree with those who think this is unnecessarily draconian, but I certainly get their point. They want to force the issue, and perhaps they are right that such extreme measures are necessary to wake the scientific community out of their P-value coma.
Conclusion
The ASA’s bold action to put out their first position paper on a fundamental statistical method in their 177 years of existence may signal, finally, the end of the P-value era. I’m not holding my breath because deeply ingrained culture changes only slowly and with difficulty. It is an important step, however, and adds to a host of other steps in that direction.
Moving away from over-reliance on P-values is also in line with what we advocate at SBM. In fact, one of our major criticisms of the practice of evidence-based medicine (EBM) is that is relies too heavily on P-values in clinical studies, without (as the ASA advocates) putting the claims in a full scientific context.
This is precisely why we have been strong advocates of Bayesian analysis, which asks the correct question – what is the probability that the hypothesis is true given this new data?
Relying on an EBM approach and P-values has led to absurd outcomes, such as recommending homeopathic treatments (or at least further study) based upon a few clinical studies. Given the extreme scientific implausibility of homeopathic potions, the most likely explanation for any P-values <0.05 is P-hacking, or just publication bias. Only consistently positive studies with rigorous design, clinically significant results, and independent replication should be worthy of attention.
In medicine over-reliance on P-values has arguably led to the adoption of many treatments that in retrospect were worthless or even counterproductive. Where we set the threshold for adopting new treatments is a vital question for the medical profession, and we need to pay very close attention to how such decisions are made.
Putting the P-value in its place is a much needed correction to this process.