[Editor’s note: Scott is busy this week, so we bring you instead a post from past contributor Dr. David Weinberg.]
The topic of p-hacking has come up frequently in the recent writings on Science-Based Medicine. These posts are frequently postmortems of flawed studies, pointing out a variety of practices collectively referred to as p-hacking. The term is working its way into popular vocabulary. But what is p-hacking? Why is it bad?
P-hacking is a mis-use of statistical metric known as a p-value. P-values have a very specific meaning, but even experts have difficulty explaining the concept of p-values to laymen. In this article I will try to give an explanation methods and perils of p-hacking for those without a background in statistics.
Producing novel, impactful research is how investigators get published, get funded, get promoted, and make names for themselves in academia. It is also how medical entrepreneurs gather evidence and regulatory approval for their products. For better or worse, one of the factors that makes research more interesting and impactful is attaining findings that are “statistically significant.” Studies with statistically significant results are more likely to be submitted for publication and, if submitted, more likely to be accepted. Statistical significance is considered a proxy for scientific significance or clinical significance. If conducting research for regulatory approval (such as for the FDA), achieving statistically significant results is likely to be essential for success. The metric that is used as a yardstick for statistical significance is know as a “p-value.”
The obsession with p-values and statistical significance has come under scrutiny by some critics. I will not go into all the criticisms of traditional use of p-values and significance testing but, by following the link here, you can learn more. The pressure to create statistically significant research motivates some investigators to gently “massage” the data or analyses to transform a statistically insignificant result into a statistically significant one. The term “p-hacking” has been coined to collectively describe a variety of dubious practices that investigators may employ to achieve statistically significant results when statistical significance was not truly earned. These techniques violate sound statistical principles, increase false positive results, and exaggerate real positive results, a bias known as truth inflation.
P-hacking techniques may be employed naively by well-meaning investigators who believe they are polishing and presenting data in the best light. The motives may be innocent, but the consequence is the proliferation of false and exaggerated conclusions.
To understand p-hacking one needs a basic understanding of the null hypothesis, statistical testing, and p-values.
The null hypothesis
Medical research often looks for differences between variables, or changes in variables over time: do smokers have a greater risk of lung cancer than nonsmokers? Do diabetics treated with Drug A have lower blood sugars than those treated with Drug B? Do children raised in Springfield have higher IQs than children raised in Shelbyville? Under most circumstances the statistical tests exploring these differences start with the default assumption that there is no difference between the variables of interest. This assumption of no difference is known as “the null hypothesis.” The objective of research is then to collect data and run appropriate statistical analysis. If the results demonstrate a difference between the groups and are persuasive enough, the difference between the groups is declared to be “statistically significant.” The null hypothesis is rejected and differences among the variables are accepted.
Statistical testing and p-values
What findings are “persuasive enough” to reject the null hypothesis? This question is the essence of statistical testing. If we set the bar for persuasiveness too low, we risk rejecting the null hypothesis too easily resulting in unjustified conclusions. We might incorrectly conclude that Springfield High students are smarter than Shelbyville students when they are truly the same. This is known as a false positive (or Type 1) statistical error. If we set the bar for persuasiveness exceedingly high, we make it difficult to reject the null hypothesis when real differences exist (a false negative or Type II error). Although significance testing is employed as a safeguard against false positive results, p-values have a very specific meaning, and are widely misunderstood.
P-values are expressed as decimals on a scale from 0 to 1. The p-value is the likelihood that a particular or a more extreme result would be obtained IFthe null hypothesis is true. Let’s pretend we run an experiment and measure a small difference between the IQs of high school seniors in Springfield and Shelbyville. We run an appropriate statistical test and find a p-value of 0.45. This means that IF students in Springfield and Shelbyville have equal IQs, we would expect to find this particular result or a more extreme difference 45% of the time. For most purposes this would be insufficient evidence for rejecting the null hypothesis and we would not be confident declaring that students in Springfield are smarter than those at Shelbyville. Smaller p-values indicate that a given result is less likely to occur if the null hypothesis is true. It is worth mentioning that failing to obtain a p-value low enough to reject the null hypothesis does notallow us to conclude that the null hypothesis is true. In other words, despite a high p-value, Springfield seniors might actually be smarter than their Shelbyville counterparts, but for whatever reason our tests did not confirm this.
There are many reasons that a study might fail to reject the null hypothesis. It could be the null hypothesis is true, but it could also be a poorly designed study, or insufficient sample size, or a very small but real difference, or just bad luck. The p-value does NOT tell you how likely it is that the null hypothesis is really false or that the alternative hypothesis is really true. In most cases, there is insufficient information to make a reliable calculation of those propositions.
Before performing an experiment researchers are obligated to define a threshold p-value. If the results of the experiment yields results that differ sufficiently from the null hypothesis to generate a p-value at or below the preselected threshold, the null hypothesis is “rejected” and the deviation from the null is declared statistically significant.
The designation of the threshold p-value is somewhat arbitrary but depends on the researchers’ tolerance for false positive and false negative results. A more stringent (lower) threshold will decrease the likelihood of false positive results (i.e. finding a difference when there is not one), but will also increase the likelihood of false negative results (i.e. failing to find a difference that actually exists). For medical research the threshold p-value is almost universally P≤.05.
The p≤.05 standard
Selecting a threshold p-value is a based on a variety of philosophical and practical considerations. Philosophically, we want to avoid false positive results – but there is a tradeoff. A very stringent threshold p-value decreases the chances of false positive findings, but effectively raises the hurdle to validate a truly positive finding. In other words it also decreases true positive results and increases false negative results. This can be overcome by designing larger, more powerful studies. Unfortunately, there are practical limitations to the funding and other resources for biomedical research, so larger studies are not always possible or practical. For better or for worse, a threshold p-value of .05 has become the de-factostandard for much of medical research. The consequence of a threshold p-value of .05 is that in situations where the null hypothesis is true, the research will erroneously reject the null hypothesis in 5% (1 in 20) of studies.
A dicey metaphor
Let us explore the implications of a threshold p value of .05. p-values can range from 0 to 1. we can divide this range into 20 increments like this:
7. > .3-.35
We assign each of these increments to one side of a 20-sided die, as depicted below.
If we are comparing 2 groups that are, in fact equal (the null hypothesis is true), utilizing a threshold p-value of .05, every study is like a roll of that 20-sided die. One out of every 20 rolls of the die will land on the ≤.05 side, and we will erroneously reject the null hypothesis and declare the two groups to be different.
P-hacking in action
Let’s say I run a startup company and have a promising vaccine to prevent those afflicted by a zombie bite from being transformed to the walking dead. I design a clinical trial comparing Zombivax vs placebo. The results of this study will result in the success or demise of my company. At the end of the study 43% of the Zombivax group became zombies compared to 68% of the Placebo group. How confident can I be that the difference between the treatments is real? In the case of our study, the null hypothesis is that Zombivax and Placebo are equally effective (or ineffective) in preventing zombïism. If our numbers for Zombivax vs placebo achieve a P≤.05 we can then declare Zombivax superior to placebo.
We analyze our results and achieve a p-value of 0.09. This means that IF Zombivax and placebo are equal (the null hypothesis) and we were able to run our clinical trial over and over again, we could expect, by chance alone, our results (43% for Zombivaz vs 68% for placebo) or a more extreme result 9% of the time. This result does not meet the traditional threshold of P≤.05, so using conventional standards, we would not be able to declare Zombivax more effective than placebo.
As CEO of the company that makes Zombivax, I am very disappointed that the clinical trial did not achieve statistical significance. I instruct my statisticians to go back and review the study design and analysis to see if any details were done “improperly” that might have resulted in the disappointing P=.09 result. They notice that some of the subjects in the Zomivax group missed one of 3 doses of the vaccine. If they omit those subject from the analysis, the Zombivax group does a little better, now achieving a P value of .07! Also, we suspect that some of the placebo-treated may not have really been bitten. Theses subjects were omitted and the data reanalyzed. This changes the p-value to .11, so this analysis is abandoned.
Now the statisticians detect that the vaccine doesn’t seem to work as well in older subjects. If they limit the analysis to subjects 50 or younger, the results look much better, yielding a p-value of .04! As CEO I give my statisticians a bonus and issue a press-release declaring Zombivax a medical breakthrough.
So what is wrong with exploring changes in the data and analyses to optimize the results? Once the data are known, there are many ways things can be adjusted and manipulated that will change the p-value. If one is so motivated it is possible to explore alternatives, accept those that move the results in a desirable direction and reject those that do not. This enables investigators to transform negative or borderline results into positive ones. This is the essence of p-hacking.
Using the 20-sided die as a metaphor, the clinical trial of Zombivax rolled the die. Unfortunately for our company, the die did not land on the ≤.05 side. It landed on the adjacent side for p-values between .05 and .10. What I instructed my statisticians to do is to kick, nudge, and tilt the table until the die rolls over to the desired result. If the die rolls in the wrong direction, they can just reset the die to the original roll and try something else. With enough motivation and creativity, it is likely that they can get the die to fall on the desired side and declare statistical significance.
If Zombivax was truly worthless, our clinical trial and subsequent p-hacking would be an example of a pure false positive result. If Zombivax was slightly effective, our p-hacking would be an example of “truth inflation,” transforming a small, statistically insignificant result into a larger, statistically significant one.
There are many options in the p-hackers toolkit; too many to mention in this article. I will discuss a couple of the more common ones.
Flexible sample size
When doing research it is traditional to pre-specify the sample size (e.g., number of patients, specimens, test runs, etc.) for the study. Under ideal settings this would be done based on existing clues about the behavior of the groups being compared, and through the use of power calculations to ensure that the planned study has a reasonable chance of finding a real positive, if one exists. Often sample sizes are based on more practical considerations, such as the number of subjects available for study, funding, etc. The p-hacker’s way to do it is to enroll a few patients, run analyze the results, enroll a few more and repeat the analysis. This cycle is repeated until a statistically significance result is achieved. Then enrollment is halted. At first glance this seems like a very efficient way to do a study. Only the minimum number of patients needed to achieve statistical significance are needed.
Here’s the problem.
If you want to minimize false positives, you have to roll die and accept the final lie. During the course of the roll, the die will inevitably roll over multiple sides before it ultimately comes to rest. By repeated enrolling and re-analyzing, it is as if we take intermittent snapshots of the die in motion. If they happen to catch the die with the <.05 side face up, the die is stopped mid-roll, and victory is declared. In order to avoid excess false positives you have to set the parameters in advance and accept the outcome of the roll.
Other researcher degrees of freedom
There are many factors that can be tweaked to manipulate study outcomes and p-values. These have been called “researcher degrees of freedom.” An amusing but cautionary paper demonstrated that motivated manipulation of researcher degrees of freedom can dramatically alter research conclusions to such a degree that even absurd conclusions can be “proven” with statistical significance. Researchers make many decisions when they design a study. What kind of patients, what age range, how many, how long they will be followed, what parameters will be measured, at what points in time, etc. If some patients miss exams or doses of medicine, how that be handled during data analysis? What statistical tests will be used, and on, and on. Ideally, these parameters will be defined before the study is begun. Any deviation from the predefined study plan would have to be disclosed and justified when presenting the study results.
Data dredging and HARKing
I can think of no better example of so-called data dredging than this gem from xkcd:
In the Great Jelly Bean study, authors report the shocking result that green jelly beans are linked to acne, complete with a statistically significant p-value. What they failed to disclose in their press release is that they ran analyses on 20 colors of jelly beans and obtained a “significant” p-value once. If one has a large enough database, and runs enough analyses, one is almost certain to stumble on a relationship that is statistically significant. It is just a matter of numbers. Rolling the 20 sided die over and over is bound to produce “statistically significant’ results by chance alone.
There are legitimate ways to test multiple hypotheses, but they require more stringent p-values to declare statistical significance. Had the jelly bean authors disclosed the multitude of analyses they performed, their results would have earned a yawn, not a headline.
This is closely related to the practice known as HARKing (hypothesizing after results are known). In HARKing, investigators look at the data, run multiple analyses until they find something interesting (and probably statistically significant), then pretend that the results they found were what they had been looking for in the first place. If the Jelly Bean Study authors constructed a rationale that green jelly beans were uniquely suspected to cause acne, and reported their results as a confirmation of this hypothesis while conveniently neglecting to report the other 19 analyses, they would be guilty of HARKing.
Conclusion: The significance of insignificant results
The extent to which p-hacking can manufacture false positive results or exaggerate otherwise insignificant results is limited only by the P-hacker’s persistence and imagination. The results of p-hacking are much more consequential than simply padding an investigator’s resume or accelerating an academic promotion. Research resources are limited. There is not enough funding, laboratory space, investigator time, patients to participate in clinical trials, etc to investigate every hypothesis. P-hacked data leads to the misappropriation of resources to follow leads that appear promising, but ultimately cannot be replicated by investigators doing responsible research and appropriate analysis.
Provocative, p-hacked data can be the “shiny thing” that gets undeserved attention from the public, the press, and Wall Street. Of even greater concern, compelling but dubiously-obtained results may be prematurely accepted into clinical practice. And within the CAM world, charlatans may can use sloppy research to promote worthless and irrational treatments.
There is no clear solution to solve the problem of p-hacking. Better education of investigators could reduce some of the more innocent instances. Greater transparency in reporting research results would disclose potential p-hacking. Deviations from planned data gathering and analysis plans should be disclosed and justified. For clinical trials, registries such as clinicaltrials.gov and alltrials.net are intended to provide transparency in the conduct and reporting of clinical trials. Investigators are supposed to “register” their studies in advance, including critical features of study design and an analysis plan. If used as intended, deviations from the registered and reported study details would be evident, and a red-flag for potential p-hacking. For my specialty, colleagues and I compared published studies to clinical trial registries. We found that registries are being underutilized in ways that greatly undermine their intended value. Sadly, my specialty is not unique in this regard.
Reducing or eliminating the reliance of p-values and the arbitrary dichotomy of statistically significant or insignificant results has been proposed by the American Statistical Association. Some journals have gone as far as banning p-values and significance-testing in paper they publish.
Greater understanding of p-hacking among investigators, journals, peer-reviewers, and consumers of scientific literature will promote more responsible research methodologies and analyses.