Recently a study was published in The BMJ that purports to show a link between ambient pesticides (essentially mothers living near a farm that uses pesticides) and the risk of autism. This is feeding the anti-glyphosate frenzy that is currently in vogue, but it’s actually a good example of how slippery research methods result in dubious findings that get completely misreported.
Meanwhile, a commentary was also recently published in Nature essentially arguing that the concept of statistical significance should be abandoned. I discussed the paper at length on my other blog, but here is the quick version.
The primary method for determining significance is the p-value – a measure of the probability that the results obtained would deviate as much as they do or more from a null result if the null hypothesis were true. This is not the same as the probability that the hypothesis is false, but it is often treated that way. Also, studies often assign a cutoff for “significance” (usually a p-value of 0.05) and if the p-value is equal to or less than the cutoff the results are considered significant, if not then the study is negative.
There are multiple problems with the way in which the p-value and significance have come to be used in modern science and especially the reporting of science to the public. First, the p-value was never intended to be the one measure of how interesting scientific results are. A low p-value does not prove the hypothesis, and can occur even if the hypothesis being tested is false. Using a cutoff for significance is also a false dichotomy – in reality p-values are a continuum. There are many other aspects of the data that are as or more important than p-values, such as effect sizes, reproducibility, confidence intervals, and a Bayesian analysis of probability.
Relying on an arbitrary cutoff for significance also encourages p-hacking – exploiting researcher degrees of freedom to nudge the results over the finish line. P-values themselves are not very reproducible. If you do the identical study multiple times, you will get a range of p-values, so which is the correct one? In the end, p-values alone don’t tell us very much about the reality of the hypothesis.
In response to increasing awareness of the many problems with p-values and the overreliance on arbitrary “significance” several fixes have been proposed. Some journals have simply banned p-values. They will not publish studies that rely on them, and encourage the use of more thorough statistical analysis. Others have proposed to lower the cutoff for significance, from 0.05 to 0.005 for example, in order to at least reduce the false positives (although this would increase the false negatives).
This new commentary proposes yet another solution – keep the p-value, but get rid of an arbitrary cutoff for statistical significance. In its place, add a more thorough statistical analysis.
What does all of this wonky statistical arguing mean for the average lay person who just wants to know if eating chocolate will help them lose weight (it won’t)? The big take-away is – do not equate statistical significance with the hypothesis being true. Realize that most such studies are ultimately wrong. Wait for an expert to put the results into context. Think about things like plausibility – do the results even make sense?
With all this in mind let’s get back to the pesticide and autism study. What they did was gather data from registrars regarding use of pesticides and also autism diagnoses and did some fancy shmancy statistical analysis (if you must know, they did a multivariable logistic regression analysis). They found:
Risk of autism spectrum disorder was associated with prenatal exposure to glyphosate (odds ratio 1.16, 95% confidence interval 1.06 to 1.27), chlorpyrifos (1.13, 1.05 to 1.23), diazinon (1.11, 1.01 to 1.21), malathion (1.11, 1.01 to 1.22), avermectin (1.12, 1.04 to 1.22), and permethrin (1.10, 1.01 to 1.20). For autism spectrum disorder with intellectual disability, estimated odds ratios were higher (by about 30%) for prenatal exposure to glyphosate (1.33, 1.05 to 1.69), chlorpyrifos (1.27, 1.04 to 1.56), diazinon (1.41, 1.15 to 1.73), permethrin (1.46, 1.20 to 1.78), methyl bromide (1.33, 1.07 to 1.64), and myclobutanil (1.32, 1.09 to 1.60); exposure in the first year of life increased the odds for the disorder with comorbid intellectual disability by up to 50% for some pesticide substances.
Mainstream media reports boiled all this down to – glyphosate causes autism. But there are numerous caveats with this study, and they add up to the fact that we cannot conclude much from this data. The first thing I noticed was that the odds ratios were pretty tiny, so the effect size here is relatively small. Given that this is a population-based observational study, that means that there are many potential confounding factors. The authors try to control for the obvious ones they can think of, but it’s impossible to think of them all. Also, the smaller the effect size, the more subtle the confounding factor and the harder it is to control for.
However – there is a bigger problem which brings the statistical results themselves into question. When looking at the data another question that quickly came to mind was – how many different comparisons did they actually make? Well, the first response to the study published by The BMJ had the answer. John Tucker, PhD noted:
In a textbook example of multiple hypothesis testing , the authors examined the effects of estimated exposure to 11 different pesticides during 3 different developmental periods against two different adverse developmental outcomes. From among the 66 evaluated endpoints, they conclude that prenatal exposure to 6 of these pesticides is associated with 10-20% increases in risk of autism disorder, and that prenatal exposure to a partially overlapping list (3 of 6) is associated with autism disorder with intellectual disability.
This is one of the “researcher degrees of freedom” that Simmons et. al. warned about in their seminal paper. This is also extremely common in scientific papers – researchers look at lots of data from many different angles until they find something interesting, and then publish that. This is OK if you are just generating hypotheses in a preliminary study, but the data actually means nothing until you confirm the results with fresh data. In fact, there is increasingly a call for researchers to more routinely do just that. Don’t bother publishing until you have done an “internal replication”. Otherwise, we just flood the scientific literature with false positives from mining data for any coincidental correlations.
So in the end we have a paper that used multiple comparisons to find very small effect sizes, which is highly likely to be a spurious finding that will not replicate. But it doesn’t matter, because the ideologues have already seized upon this paper as vindication for the evils of glyphosate and even GMOs (even though this has nothing to do with GMOs). Explaining why the results are likely not reliable mostly causes people’s eyes to glaze over. (People have told me that as soon as I mention p-values they tune out.)
Clearly these statistical issues need to be sorted out by scientists and statisticians, and not the public at large or even the media. At least the problem is recognized and being discussed, and potential solutions being offered. Clearly the institutions of science need to take a thorough look at the whole issue of p-values and statistical significance and how they are used and reported in research.