In 1975 economist Charles Goodhart observed, “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” This basic idea has had many formulations, such as by Marilyn Strathern who stated the principle as, “When a measure becomes a target, it ceases to be a good measure.”
This is a subset of the more general principle of unintended consequences, sometimes called the “cobra effect”. During British colonization of India they instituted a bounty on cobras to reduce the presence of venomous snakes in the streets. This worked at first, but then the locals started breeding cobras to turn in for the bounty. When the authorities figured this out, they ended the bounty. The captive cobras were mostly then released, resulting in a net increase in loose cobras.
It has been known for some time that the journal impact factor (JIF), which is used as the primary measure of a scientific journal’s academic value, has been distorting the publication ecosystem. In 2016 Mario Biagioli wrote an editorial pointing out the problem of “cheats in the citation game“. He argued:
It is no longer enough for scientists to publish their work. The work must be seen to have an influential shelf life. This drive for impact places the academic paper at the centre of a web of metrics — typically, where it is published and how many times it is cited — and a good score on these metrics becomes a goal that scientists and publishers are willing to cheat for.
The result is that researchers are gaming the system, just like people game every system that has any value. For example, a researcher may pad the reference section of one paper with self-citations to increase impact of previous articles they have published. This sort of post-publication cheating, as Biagioli discusses, does not compromise the science, but it does reduce the integrity of the publication system.
However, there are distorting effects that do affect the quality and utility of science that gets published. Journals are more likely to publish articles that they think will be highly cited in order to improve their own JIF. This means they are biased toward new and interesting findings or those which seem to contradict conventional wisdom (the scientific equivalent of man bites dog). But these are also precisely the kinds of studies that are most likely to be erroneous in their ultimate conclusions, and even retracted. There is also a bias against publishing replications, because these are considered boring, even though they play a critical role in the legitimacy of science.
Misconduct, in turn, is the greatest cause of papers being retracted. (About two thirds of retractions are due to misconduct, and most of those are due to fraud.) This implies that the JIF system is creating not only an incentive to focus on research likely to get published in high impact factor journals (rather than research that is inherently interesting or useful) but also provides an incentive for fraud. Obviously there are other incentives in academia that contribute to this, but the pressure to not only publish but to generate citations is part of the picture.
So what is the solution? A commentary published recently in Nature proposes if not a solution an approach to developing a solution. It’s interesting that they point to the historical emergence of JIF as the sole measure of the academic value of a journal:
The Journal Citation Reports, presenting the JIF and other journal indicators, were conceived in 1975 as a summary of journals’ citation activity in the Science Citation Index (now owned by Clarivate Analytics in Philadelphia, Pennsylvania). It was specifically intended to support librarians who wanted to evaluate their collections and researchers who wished to choose appropriate publication venues, as well as to provide insights for scholars, policymakers and research evaluators. Its inventors never expected the broad use and rampant misuse that developed.
This reminded me of the p-value. As I have pointed out here before, the p-value was never intended to be the one measure of whether or not a studied effect is likely to be real, but it became that. The result was p-hacking, gaming the system in order to have a positive study, which is easier to get published in a high impact factor journal.
The authors of the Nature commentary correctly, in my opinion, point out that a core of the problem is overreliance on a single metric. Relying on a single metric is tempting Goodhart’s law. It is simply too easy to game. Therefore, it is better to rely on multiple metrics. At least then it is harder to game, and the unintended consequences will be more diffuse and may balance out somewhat.
I like that they started by asking some basic questions, like what scientific journals are even for:
We delineated the key functions of journals, which remain largely unchanged since their inception more than 350 years ago. These are to register claims to original work, to curate the research record (including issuing corrections and retractions), to organize critical review and to disseminate and archive scholarship.
Therefore, how is it best to achieve these various goals? The metrics used to evaluate the academic quality of journals should reflect all of these goals. Impact factor alone doesn’t.
They also outline the features of good metrics. For example, they should be relevant, reproducible, contextualized, justified, and informed. So, they objectively measure something real and important about the quality of a journal, they are objective enough to be reproducible, and people know how to use the metrics and how not to cheat.
But perhaps most importantly – the system of metrics must be adaptable. This is because, no matter what metrics you use, people will try to game them. You need to measure the metrics, and then make adjustments to avoid abuse.
This is essentially what Google does, and why it has remained an industry leader in search engines. There is also an entire industry of “search engine optimization” (SEO). Optimization consists of methods to game Google’s ranking algorithms – its metrics of the relative value of websites. So Google has a team of people whose job it is to make SEO not work.
The Nature authors propose a “governing body” of stakeholders whose task is to institute a system of journal metrics, and then monitor those metrics and make adjustments to minimize abuse. The governing body would make the metrics adaptable.
I think this is a great idea, and I put it alongside the same proposal for statistical analysis in articles. Get rid of the one metric to rule them all – demote JIF and p-values from their perches as sole and overly powerful metrics. Replace them with systems of metrics that are more thorough in their analysis, more properly reflect what is trying to be measured, and minimizes hacking of the systems.
This is a basic concept that needs to be incorporated into the science of science itself.