Does a recent study demonstrate that being kind to yourself has benefits for your mental and physical health? The authors think so. I see another thinly disguised null trial evaluating meditation with the usual confirmation bias.
Maybe we are left being kind to ourselves because it feels nice to do so, because there is currently no research showing substantial benefits for our health. Probing this study, we can learn a lot about research on meditation, and how hard it is to conduct a decent study. We can think about the pressures on authors to report benefits for meditating and to keep looking when these benefits are not found at first glance. I end with some suggestions for doing simpler, better studies of meditation, though these studies would carry the risk of discovering that meditation is not all that it is cracked up to be.
What if the effects of practicing standard meditation cannot be distinguished from something deliberately constructed as “sham meditation”, akin to sham acupuncture? What if effects cannot be distinguished from asking people to stop for a moment to think, or for them to exhale and make their breathing more rhythmic for just a bit? The huge program of meditation research is broken. Progress is not being made in showing meditating is anything more than a ritualized placebo.
You would never know the scientific status of meditation from the way it is portrayed in media. Although some cracks are developing in the façade, the media is almost uniformly positive in extolling the benefits of meditation for our mental and physical health, our close relationships, and our work performance. There is a strong message that we are not doing all that we can do to be all we can be unless we are meditating. Yet, the evidence for practicing meditation, rather than doing something else to get the same benefits, is weak.
How the study of meditation is broken
The mother of all reviews of mindfulness, a comprehensive systematic review and meta-analysis prepared for the US Agency for Healthcare Research and Quality (AHRQ), identified 18,753 citations for preliminary review, but only 47 trials (3%) of them with a total of 3,515 participants included an active control treatment.
The review found:
Low evidence of no effect or insufficient evidence of any effect of meditation programs on positive mood, attention, substance use, eating habits, sleep, and weight.
No evidence that meditation programs were better than any active treatment (ie, drugs, exercise, and other behavioral therapies).
The authors of another review included a leading proponent and researcher of mindfulness meditation, Richard Davidson. The review addressed “Is mindfulness research methodology improving over time?” The authors created a 6-item checklist for evaluating trials. Items included whether there was ample sample size to detect moderate treatment effects if they are present, whether there was an active control group, and whether all patients who had enrolled in the trial were included in analyses (the gold standard intent-to-treat analyses).
Across the 142 studies published between 2000 and 2016, there was no evidence for increases in any study quality indicator, although changes were generally in the direction of improved quality.
The typical randomized trial of meditation remains a small, unblinded study without an adequate, active treatment comparison/control group. Almost all studies depend on subjective self-report to assess outcomes. A trial with these design elements would likely be unable to detect differences between a treatment with established efficacy and homeopathy or sham acupuncture. How do we account for the excess of statistically significance findings reported in the peer-reviewed literature, particularly when you take into account the small size and methodological quality of the typical study?
The choice of comparison group stacks the deck almost to guarantee positive results, if there is a large enough sample. There is also confirmation bias, as analyses and papers that obtain positive results are more likely to get published. There is also liberal use of what are euphemistically called investigator degrees of freedom in picking particular outcome variables and analytic strategies from among many to ensure that there are positive results to be reported, even in those studies with inadequate sample size.
The new study
The recent study of the benefits of practicing self-compassion meditation was designed with the intention of fixing a number of the problems in a typical study of meditation. Ultimately, it fell victim to the authors’ grand ambitions. The study ended up a disguised null trial, not telling us very much – except demonstrating once again the strong influence of confirmation bias and the confidence that the physical and mental health benefits of meditation have already been established.
We can learn a lot delving into the details of how the illusion of a positive study was constructed, good enough to get through peer review.
It is also a tale of how difficult it is to conduct an adequate trial of meditation, given existing methods and tools, and how easy it is for authors to convince themselves that they have positive findings, when they rely on the acceptable flexibility in conducting and reporting a trial of meditation that is so common in the literature.
But first, we need to appreciate a pervasive pattern in the meditation literature, namely:
- Authors typically start with an unrealistic portrayal of what we already know about benefits of meditation and the strength of evidence for the mechanisms by which these benefits are achieved.
- The authors then conduct a study that doesn’t really extend this evidence, and usually does not even provide a critical test. The authors nonetheless make claims that the study succeeds on both counts, and rely on commonly used ways of making weak or null findings look strong.
- Spurred on by hyped press releases, the media continues its strong confirmation bias for the benefits of meditation, citing what is portrayed as new corroborative evidence.
The pressure remains on the authors of the next trial to show again that mediation “works” and the ways by which we know it works.
Investigators of meditation do not necessarily consciously distort their findings to make meditation appear effective, although some promoters of meditation may be prone to do just that. Existing ways of conducting and interpreting a study are biased towards finding a positive effect. If investigators do not find benefits, they are convinced that they have done something wrong by what they compare with what is already in the literature. They can then get it right by trying another analysis or adopting another interpretation. Maybe even reviewers of their manuscript will coax them to try something different.
Readers need to appreciate how self-evident certain assumptions are held to be by the meditation community about the nature of stress processes and how meditation works to relieve stress. Whether because of conviction or as a marketing strategy, proponents of meditation take these assumptions to be no longer be in need of evidence. Research is largely a matter of demonstrating these assumptions, not testing them in a way that they can be disconfirmed. Not confirming these assumptions is taken to be a failure of the researcher.
The peer-reviewed article
The study’s authors included Willem Kuyken, an accomplished Oxford University researcher who is one of the leading proponents of mindfulness in the UK. The peer-reviewed paper stated:
The aim of this study was to investigate the mechanisms whereby self-compassion confers benefits, using a novel experimental paradigm employing carefully designed experimental and control manipulations and psychophysiological measures complementing self-report.
The press release for the study that echoed through the media explained:
Our study is helping us understand the mechanism of how being kind to yourself when things go wrong could be beneficial in psychological treatments. By switching off our threat response, we boost our immune systems and give ourselves the best chance of healing.
The press release explained some seemingly exciting results:
The findings suggest being kind to oneself switches off the threat response and puts the body in a state of safety and relaxation that is important for regeneration and healing.
The researchers said the threat system comprises increased heart rate and sweating, release of the stress hormone cortisol and over-activity of the amygdala, an integral part of the brain’s emotional network. And a persistent threat response can impair the immune system.
Perhaps, but this study did not assess the immune system, regeneration, cortisol, the amygdala, or the brain’s emotional network.
The abstract described what seems to be exceptionally well-designed study.
Self-compassion and its cultivation in psychological interventions are associated with improved mental health and wellbeing. However, the underlying processes for this are not well understood. We randomly assigned 135 participants to study the effect of two short-term self-compassion exercises on self-reported-state mood and psychophysiological responses compared to three control conditions of negative (rumination), neutral, and positive (excitement) valence. Increased self-reported-state self-compassion, affiliative affect, and decreased self-criticism were found after both self-compassion exercises and the positive-excitement condition. However, a psychophysiological response pattern of reduced arousal (reduced heart rate and skin conductance) and increased parasympathetic activation (increased heart rate variability) were unique to the self-compassion conditions. This pattern is associated with effective emotion regulation in times of adversity. As predicted, rumination triggered the opposite pattern across self-report and physiological responses. Furthermore, we found partial evidence that physiological arousal reduction and parasympathetic activation precede the experience of feeling safe and connected.
Seemingly a big study
The sample size of 135 was considerably larger than the typical study of the effects of meditation.
But a hint at what is ahead: what really matters is not the size of the study, but the size of the individual groups being compared.
Seeming better controlled than most meditation studies
It is exceptional for a study to have not one, but two active meditation groups, practicing different types of meditation exercise, along with not one, but three comparison/control groups.
This study compares five groups, dividing up 135 patients into equal sized groups. Maybe this is at first glance impressive, but why are there so many groups and what research question will guide comparing them?
Two kinds of self-compassion exercises
- In the CBS [compassionate body scan], participants were guided to direct kind and compassionate attention to their body sensations, starting from the top of the head and going down to the feet.
- In the LKM-S [loving-kindness meditation for the self] condition, participants were first guided to bring to mind a person they felt a natural sense of warmth toward and to direct friendly wishes toward this person. After this, participants were invited to offer the same friendly wishes toward themselves.
Inductions for these practices, along with those comparison/control groups, were delivered in 11 minutes of taped instruction.
How different are these two meditation exercises? Put aside the different sounding names. In the first instance, participants are directed to pay kind and compassionate attention to their body sensations. In the second, they are directed to offer friendly wishes to themselves. Would the recipients of one versus the other set of instructions actually be doing something radically different? Could we be confident that self-report of subjective mental states of meditators would reflect those differences?
Three comparison/control groups
In the self-critical rumination condition, participants were asked to dwell on something they felt they had not managed or achieved as they would have wanted to.
In the positive excitement condition, participants were asked to think about certain aspects of a positive event or situation in which they were working through or achieving something great.
In the control condition, participants were guided through a routine supermarket shopping scenario.
Control groups should control for something. The rationale behind having a control group is that treatment and control groups are identical in every respect except as to whether they receive the treatment.
The ideal blinded, randomized controlled pill-placebo trial controls for the ritual and support that goes along with being provided the active treatment in the context of a clinical trial. The patients being given a placebo also get the same nonspecific factors as the patients getting the active medication. Neither they nor their provider knows that they are being given only an inert pill.
What exactly is controlled in having these three comparison/control groups?
Capturing the subjective experience of self-compassion meditation exercises
Seven visual analog scales or prompts were used to measure the effects of the inductions on the participants’ mood state. Here is an example.
The total score on the 26-item Self-Compassion Scale (SC) was used to assess self-reported self-compassion. Two forms of self-criticism were assessed with the Forms of Self-Criticizing/Attacking & Self-Reassuring Scale (FSCRS).
The Self-Compassion Scale (SC) sounds like something quite different than the Criticizing/Attacking & Self-Reassuring. But are they too closely opposite, so that one is the flip of the other? Some items on the Self-Compassion Scale (SC) are worded in a positive direction, with more self-compassion indicated by a higher score.
I am kind to myself when I am experiencing suffering.
Other items are worded in a negative direction, so that a lower score indicates more self-compassion.
When I think about my inadequacies it tends to make me feel more separate and cut off from the rest of the world.
One does not need formal psychometric training to detect that something could go wrong here.
Other researchers have separated out the negative-word items on the SC scale, and found the items, not surprisingly, correlated moderately to strongly with negative emotion, depressive symptoms, perceived stress, as well as to rumination and neuroticism. A large psychometric study concluded results “do not justify the common use of the SCS total score as an overall indicator of self-compassion, and provide support for the idea…that it is important to make a distinction between self-compassion and self-criticism”.
A meditation exercise is supposed to induces a psychological state, which participants report in the outcome measure. What if the participant is merely interpreting the induction as instructions on how to fill out the questionnaire, as an experimental demand? The transparent wording of the items on the two questionnaires and seven analogue scales makes it obvious how well-behaving participants should answer.
The state of the art and science of whether meditative states have been achieved is not good.
Psychophysiological measures, not just assessments of the subjective impressions of participants
This study supplemented the usual subjective self-report questionnaires with psychophysiological measures. Presumably the psychophysiological measures would be less susceptible to bias than the self-report measures and could get at the biological mechanisms that underlie self-reported subjective states.
The psychophysiological measures were heart rate, high-frequency heart rate variability (HRV), and galvanic skin response (GSR).
Let’s have no illusions. The psychophysiological measures are not biomarkers of participants being in a meditative state, they are potentially modest correlates. Researchers are not yet in a position to dispute a study participant’s report that they are or are not in some meditative state. Subjective self-report remains the gold standard for determining whether a person is in a meditative state. It is worrisome for the validity of these measures that studies find people who are asked to indicate whether they are meditating can be confused to provide similar reports for sham versus standard meditation.
Grounding self-report in the biology tapped by the psychophysiological variables is a matter of relating two sets of variables that are each messy and subject to extraneous influences. The psychophysiological variables have their own problems, with skewed distributions and large variation within and between people. The measurements are supposed to be immediately registering the effects of 11 minutes of instruction, but are also responsive to all the noise of complex things that are going on in and around each participant, including what they brought to a brief experimental session.
From a much larger literature correlating self-report with biological variables, I would estimate the noise in each would only allow only a correlation of .20 or a shared variance of 4%. That small figure diminishes any expectation that these trajectories of measures assessed over the 11 minutes could strongly reflect pre-post differences in self-report. Mediational analyses intended to determine the biological mechanism behind effects of meditation depend on partitioning these correlations. I would not expect impressive results.
Digression: I remember when I was studying hot flashes in men receiving hormone therapy for prostate cancer. I was frustrated that subjective self-report of discomfort was so susceptible to placebo effects, so I bought a fancy way of measuring thermal skin conductance, quantifying how much the men were sweating. The results were a lot more complicated and harder to interpret than I bargained for and I lost my enthusiasm for biological measures saving me from the messiness of self-report.
This study has an exceptionally complex design for a meditation study, with its two different meditation exercises, three comparison/control groups, and both subjective self-report and objectives psychophysiological data. Before we wade into the details, it be best to summarize the point I will be making:
The authors resolved a tough decision as to how best to analyze their data with a simple solution that wiped out the main benefit of having a control group. They examined whether each of the two meditation groups were statistically different on a measure administered before the induction versus the same measure given after the induction. They examined all nine outcome measures. They also did these analyses for the three comparison control groups for the same nine outcomes. So, 5 x 9 = 45 analyses. It is mind-boggling to consider all the permutations and combinations of meditation and control group and outcome measure. The authors considered their expectations confirmed if there was statistical significance for one of the meditation exercises, but not for a control group, or vice versa.
What these authors did is different from what is typically done. The typical strategy of statistical analysis for a clinical trial evaluating meditation is to conduct an overall repeated measure analysis of variance to see if the meditation group and a control comparison differed over time, pre- to post- treatment assessment. If there is more than a pair of groups being compared, the investigators can do an overall test if there are any differences and then test a hypothesis concerning which pair differs over time. Authors need to be careful not to fall into a fishing expedition, capitalizing on chance with repeated tests.
In this trial there is no clear indication why the two meditation groups would be different from each other or which comparison/control group – self-critical rumination condition, positive excitement condition, or shopping condition – would provide the best comparison. The authors started by testing if a group-by-time interaction effect existed for a given outcome variable. If significant, this analysis would indicate at least one significant difference among the groups over time was lurking somewhere. Fine, but…I would guess that maybe the self-critical rumination condition would be the best candidate to be different from the rest. But that induction did not seem particularly powerful, none of the three inductions were. Anyway, I am not sure what it would mean if self-critical rumination condition were the only group differing from the others over time.
The authors resolved the complexity by testing all seven groups to see if participants in the group had a significant difference from pre- to post-assessment on any of the nine self-report measures. They then looked to see if there was a difference within each of the two mediation groups and within each of the three control groups.The authors would consider their hypotheses confirmed if there was a pattern like “the pre-/post- differences for the meditation group were significant, but those for a control group were not”.
The psychophysiological measures were continuously monitored, so something different had to be done than for the pre-/post- self-report measures, but the logic making within-group comparisons was the same.
A novel statistical technique, latent growth curve modelling, was used to examine whether there were differences in the trajectories of psychophysiological response over an 11-minute session associated with the five groups.
Results of analyses for the psychophysiological measures for the five groups were discussed in terms of whether or not there were patterns of significant differences within groups for psychophysiological measures. The authors considered their hypotheses confirmed if a particular meditation technique that had a significant effect on self-report measures also had a significant change in a psychophysiological measure.
The authors also conducted some correlational mediation analyses to determine if changes in the psychophysiological measures mediated changes in the self-report measures. Basically, they were examining whether the psychophysiological measures might be the biological basis of any observed changes in self-report measures by seeing if the change in the psychophysiological measure was correlated with the change in self-report for a particular group.
Stacking the deck
The usual meditation randomized trial is designed so even homeopathy would appear effective. Could these authors overcome that problem? Typical studies compare one group of participants getting meditation to one poorly matched control group of participants either remaining on a wait list, or getting nothing. That comparison is stacked towards finding benefits for meditation.
How so? People enroll in meditation studies with the hope they will get training in meditation that they otherwise would not get outside of the clinical trial. Half of the people enrolling will typically be assigned to meditation, but half of the people will get put on a waiting list or simply get no treatment instead. People obviously know to which group they have been assigned. Treatment assignment cannot be not blinded. Those assigned to the control group are presumably disappointed.
Not only do the people assigned to getting meditation get meditation, they also get their positive expectations met, and they get a lot more support and attention than the control group. So, in an important sense, the typical meditation trial compares participants getting a combination of meditation plus nonspecific (i.e., placebo) factors, to getting nothing but maybe being disappointed.
Typical studies rely solely on subjective self-report questionnaires as measures of outcome. These measures are susceptible to the bias arising in an unblinded comparison of an active treatment and unmatched control group. Such a study design compares an active treatment to a mismatched control groups with subjective self-report outcome measures and would make even inert or homeopathic remedies look effective. Here is a striking demonstration.
In an important sense, randomized trials do not evaluate pre-post differences in the treatments. Rather, trials evaluate whether there are differences in pre-post outcomes, i.e. differences over time between the treatment and comparison/control group. Within-group differences for a particular treatment exaggerate its effect because it combines the active treatment and nonspecific effects.
You can now see the bias of the authors in the study conducting within-group analyses and trying to make sense of whether these differences were statistically significant within the meditation group, but not the control group.
The authors are evaluating meditation plus nonspecific factors versus nothing except perhaps a disappointing experience. They are doing that many times.
By focusing on within-group differences, the authors avoided confrontation with null findings, but any positive findings are likely being inflated and due to chance. Their sample size was too small and their multiple analyses capitalized on chance. Recall that the authors had nine self-report outcomes variables: three psychophysiological being compared in two meditation conditions and three control conditions.
The groups were too small to expect detection of even moderately-sized effects
Earlier I suggested that the impressive number of 135 participants was less important than numbers in the size comparison cells, which was 27 participants. The authors are banking on being able to detect the key differences in which they are interested with comparisons of two groups of 27 participants, not 135.
The authors state that they conducted a formal power analysis to determine the number of participants required to detect a statistically significant difference in outcome, if a difference were indeed present. But the formal power analysis is not presented in the paper or supplementary materials.
Having only 27 participants in a group is far less than the 50 needed to have 70% probability of detecting a moderate-size (d = .50) effect. Across studies evaluating psychological interventions, a modest effect size d = .30 is more reasonable, which leaves a sample of 27 with less than a 50/50 chance of detecting a difference. Most findings that would be significant in a larger sample would be missed. Because of the higher threshold needed to achieve statistical significance with n=27, most statistically significant differences will prove exaggerated or even false when replication is attempted. The mean outcome scores for a treatment group of only 27 are also unstable, subject to sampling noise. Dropping or rescoring one participant could substantially change the mean for the group, resulting in a gain or loss of statistical significance.
I am skeptical whether there is any difference more than d = 0.1 between the two highly similar meditation exercises. The benefits of evaluating the two in a single trial would require hundreds of participants to demonstrate. What possible clinical significance would there be to any such differences?
The interpretability of correlations with 27 participants per group is even more dubious. Analyses depending on a likely correlation of self-report of ~.20 between self-report and psychophysiological measures would require 167 participants per group. Differences between correlations would require even more participants. Furthermore, changes in correlation due to the data of one or two participants could fluctuate greatly. It would take 100s of participants for the correlation to stabilize, far more than the largest of trials of meditation.
We should not be chasing statistical significance in this study anyway
It is never a good idea to determine the outcome of a clinical trial by comparing whether within-group differences are significant in treatments versus controls. Two conditions could differ in whether they achieved statistically significant differences (p< .05) versus the null hypothesis of zero effect without differing from each other. The problem is particularly severe in small trials where a large effect, usually obtained by chance, has achieved statistical significance.
A finding of statistical significance (p<.05) is meaningless if there are so many comparisons without a specific hypothesis. With the amount of noise in play in this study, the probability is high that random fluctuation will be misclassified as a meaningful result. Especially with small groups, within-group differences contaminated with placebo effects, ineffective control groups, and multiple subjective self-report measures that are susceptible to participants wanting to give the investigators what they seem to be asking for in the instructions for the inductions.
But more generally:
P-values have taken quite a beating lately. These widely used and commonly misapplied statistics have been blamed for giving a veneer of legitimacy to dodgy study results, encouraging bad research practices and promoting false-positive study results.
Fortunately, the American Statistical Association has provided a great service by recently issuing an Official Statement on Statistical Significance and P-Values, along with some commentaries by experts in its great press releases with great quotes.
The p-value was never intended to be a substitute for scientific reasoning.
Over time it appears the p-value has become a gatekeeper for whether work is publishable, at least in some fields. This apparent editorial bias leads to the ‘file-drawer effect,’ in which research with statistically significant outcomes are much more likely to get published, while other work that might well be just as important scientifically is never seen in print.
Why I will not give a serious look at the pattern of significant results claimed by the authors
Readers can check the authors’ presentation and discussion of results in the open access article.
Given multiple tests to choose from and the role of chance, the authors obtain both plausible and implausible results. But plausible does not equal robust and generalizable. Furthermore, what is plausible is made so by invoking received assumptions about the stress process and how meditation can reduce its effect on mental and physical health outcomes. Such assumptions are not put to a strong test in this study, but can be used to filter the noisy results today into a seemingly strong positive study and a confirmation of these assumptions, but that is one illusion giving credibility to another.
But the exercise can have an amazing effect on a reader, persuading even one who has been armed with appropriate skepticism. Go ahead, give a look, and be surprised at how you get taken into accepting “marginally significant” when to say “null” would spoil the good story being told.
That is how the appearance of progress is created in social media about a huge meditation literature that really is not going anywhere fast.
What could have been done differently?
We are again reminded that underpowered studies of meditation can be mined for significant results and published. It is a recipe for success that is so common and formulaic that many authors follow it without realizing anything is wrong. We need a lot fewer randomized trials evaluating meditation and more trials that are adequately powered, have appropriate control groups, a suitable primary outcome, and features like intent-to-treat analyses to reduce bias.
If they are interested in furthering the understanding of meditation and risking being proven wrong, the authors would be much better off testing a simple hypothesis with a simpler design and more participants. Given that we have not established that meditation is more than nonspecific factors, how about examining whether one of these self-compassion exercises has stronger effects than simply having participants sit quietly with instructions to relax for 11 minutes?
Then there is the possibility of testing self-compassion against a sham meditation. One study compared the widely used Headscape app for meditation with such a sham. Participants were told they were in a meditation study, but whereas half of them received the usual Headscape protocol, in the other condition the protocol provided a sham meditation. Each session began by inviting the participants to sit with their eyes closed. These exercises were referred to as meditation but participants were not given guidance on how to control their awareness of their body or breath. Participants in the two conditions were indistinguishable across a variety of self-report measures.
Alternatively, participants could be instructed to simply exhale and let the natural rhythm of breathing take over, paying attention to it. I would not be surprised if both simple instructions were no different in the effects on the physiological measures used in this study from the self-compassion exercises. Investigators could set some a priori criteria for rejecting the likelihood of a clinically significant difference.
An inability to demonstrate a clinically significant difference between self-compassion meditation exercises and one of these active control groups would be disappointing to promoters of meditation, but think how liberating such results could be for everybody else. We don’t need weekend retreats; we don’t need lessons nor daily practice. We can simply stop and think for a moment or maybe exhale…