In 2015 Brian Nosek and 269 co-authors published their attempts at replicating 100 psychological studies from high impact journals. They reported that only 39 of the 100 attempts were successful, which means that 61% of the original results could not be replicated.
These results kicked off what has become known as the “Replication Crisis” in science. Only three years later, however, a more nuanced picture has emerged. As our own David Gorski has also noted – there is more of a problem than a crisis.
I think this reflects the position we often find ourselves taking here at SBM. We are somewhere between the two extremes. At one end are those who are naive or true-believers, who will cite one study as if it establishes a claim. At the other end are those who would dismiss the findings of science as hopelessly flawed.
The reality is somewhere in between. Science is hard and requires rigorous methods and critical questioning. It is also highly flawed. But, the results can be meaningful and progress does grind forward. What we need is a high standard of quality and thoughtful and thorough analysis of results. When applied to medicine, that is SBM in a nutshell.
Background on the Replication Crisis
Replication is the cornerstone of quality control in science, and so failure to replicate studies is definitely a concern. How big a problem is replication, and what can and should be done about it?
As a technical point, there is a difference between the terms “replication” and “reproduction” although I often see the terms used interchangeably (and I probably have myself). Results are said to be reproducible if you analyse the same data again and get the same results. Results are replicable when you repeat the study to obtain fresh data and get the same results.
There are also different kinds of replication. An exact replication, as the name implies, is an effort to exactly repeat the original study in every detail. But scientists acknowledge that “exact” replications are always approximate. There are always going to be slight differences in the materials used and the methodology.
So, what is really meant by “exact” replication is that there are no differences that should reasonably affect the outcome.
There are also conceptual replications, in which different methods are used to look for the same basic phenomenon.
Regardless of where a replication is on the spectrum from exact to conceptual, there are significant technical challenges. Whenever any reagents or substances are used in research, they have to be obtained from a source, and there may be meaningful differences in the quality or exact nature of the material. Whenever animals are used, then it is possible that there are differences in the genetics or health of the animals. There are reports of suppliers providing the wrong animal lines, invalidating any researchers who relied upon them. The same is true of cell lines for basic science research.
In short, there are many things that can technically go wrong. Some research also involved great technical skill, and failure to replicate may simply be a factor of lacking technicians with the proper skill or experience.
Further, papers do not always provide enough detail in the methods section for someone else to follow cookbook style. They should, but they often don’t.
When evaluating the results of replications the outcomes are also not binary (success or failure). Results may still be positive, but with a smaller effect size. This is a common outcome, so common it has a name – the “decline effect.” Or, only part of the results may replicate, or only under certain conditions.
The more the original results replicate, however, the more “robust” they are considered to be. This gets to how much they are likely to be generalizable to the real world.
Research on replications
The 2015 Nosek study caused quite a stir that still resonates in the public consciousness. However, a later reanalysis by Gilbert et al found fatal flaws in Nosek’s methodology. For example, if you separate the replications into high fidelity (studies that accurately reproduced the original study, as judged by the original authors) and low fidelity, Gilbert found that the low fidelity studies had four times the failure rate as the high fidelity studies. When looking at all the data he concluded that the Nosek study actually showed that when you don’t do replications correctly you don’t get the same results, but when you do replications correctly you mostly do get the same results.
Gilbert also found significant sampling error in the choice of studies to replicated. Finally he found that the authors did not properly correct for statistical variation. In other words, if you replicate 100 true positive studies, some of the replications will be negative by chance alone. What you really need to know is – do the studies examined fail to replicate at higher than chance? Gilbert and his coauthors found that when you properly examine Nosek’s evidence, the answer is no. The results were consistent, they conclude, with all 100 original studies being true positives. (You cannot draw that conclusion, but it is consistent with the data, within statistical noise.)
The “replication crisis” bell, however, cannot be unrung. This is not necessarily a bad thing, if it brings attention to the replication “problem” that is real.
How have other replication efforts fared (or – how replicable are replication studies)? One of the more famous replication efforts is The Reproducibility Project: Cancer Biology. David Gorski has been following the results here.
This is an effort to replicate (despite the name) basic science cancer studies to see how replicable they are. The first batch of five studies showed that two replicated, one did not, and the other two could not be interpreted because of technical failures to pull off the replications. So – a 40% success rate, but really 2/3 of the studies that could be completed.
Now a second batch of replication studies has been published:
“Overall, he adds, independent labs have now “reproduced substantial aspects” of the original experiments in four of five replication efforts that have produced clear results.”
That’s good, but notice the caveats. The studies replicated “substantial aspects” of the original experiments, but not necessarily all. Also, this excludes studies that could not produce clear results.
But still, that is pretty good. I think this is consistent with the overall impression shared by David and I – the replication problem is not as bad as the sensational reporting has suggested. But it is still a legitimate issue that needs to be addressed.
The concerns are also not limited to psychology or biomedical research. When surveyed, scientists in every specialty report that they sometimes fail to replicate the work of their colleagues. A recent study attempting to replicate economics research found that 49% of the studies could be replicated, even with help from the original authors.
It should also be noted that most replication assessments focus on high impact journals. The problem is presumably worse if you sample studies from progressively less prestigious scientific journals, but that has not been directly examined.
The way forward
There are some obvious solutions that will at least improve the situation and which have broad support, but will still be challenging to implement, largely because of cultural and institutional inertia.
First, we need to do and publish more replications. Some researchers have suggested that internal replications should simply be built into research methodology.
In other words – a research lab should follow up their preliminary results with an exact replication of their own before attempting to publish. This would weed out many false positives before they ever contaminate the literature. Some labs already do this, but it should become standard, which means that journals should require it.
There should also be more academic focus on exact replications, meaning that researchers should be encouraged to do such research, and be given grants, credit, and promotions based on such research. Instead there is an acknowledged focus on new and sexy research (the exact kind that is least likely to be replicable).
It is also interesting to think about a specialty in experimental replications. In other words, imagine a lab of researchers with high technical skill, and the knowledge and expertise necessary to execute extremely rigorous research. Their primary focus could be identifying important published results, then doing exact replications to see if the results are real and robust. Such labs might be a “rite of passage” for any research results before they are considered reliable.
This kind of thing already exists in some contexts. For example, pharmaceutical companies have labs (or contract with private labs) whose job is to replicate academic studies to see if they are real before they invest on trying to develop a new drug based on that research. The financial stakes are high for the company, so it is cost effective for them to invest in exact replications.
There are, of course, downsides to all such recommendations. Doing all these replications takes resources (there is a cost to everything). Further, doing an expert replication requires not only generic expertise in research methodology, but specific expertise in the often narrow topic of interest. As I stated above, sometimes research requires specific technical skills and knowledge, and only researchers dedicated to a narrow area of research might have those skills.
In the end there is no perfect solution. But what we want to consider is – where is the sweet spot of trade-offs to maximize the return on our research investment? It seems that now the balance is shifted toward false positives and innovation, with insufficient priority given to replication and confirmation.
Perhaps we just need to tweak this balance to get more in the sweet spot and optimize scientific advance. It’s a great conversation to have, and represents the true strength of science. No other human endeavor spends so much time wringing its hands about its own methodology and validity.