WE WILL HAVE doubts about that outcome: total victory. Dogma-based controversies tend to persist. Besides, each meta-analysis is an experiment, and experiments involve choices.
Glass’s approach, including every trial ever reported, sounds uncontroversial. The ideal in research is to respect the contribution of each subject who enters any study. Count everyone.
But even Glass had applied judgment. For example, he found so many trials testing treatments for snake phobia that this specialized therapy threatened to dominate his results. He omitted some of the data sets.
Glass chose not to correct other similar problems. Because psychology departments are housed in colleges, and because college students are curious and in need of petty cash, the easiest studies to perform are of undergraduates made to experience and then recover from unease. When you embrace every experiment, your conclusion, psychotherapy works, may mean only that counseling helps venturesome young people with artificially induced problems. What has that finding to do with treatment of depression, anorexia, and the rest? The “count everyone” approach puts meta-analysis at the mercy of what happens to be in the literature. If selecting studies is suspect, using a complete set of trials can be, too.
In 1983, admirers of Hans Eysenck’s published their own meta-analysis. Psychologists at Wesleyan University found that in Glass’s collection only thirty-two studies contrasted conventional psychotherapy to a placebo condition. Taking results from those trials and using methods slightly different from Glass’s, the researchers calculated an effect size of 0.15—small impact—and the benefit came largely from studies involving recruited subjects. With real psychiatric outpatients, the effect size was basically zero.
When they employed methods closer to Glass’s, the Wesleyan group found an effect size of 0.42. These results, effect sizes of 0.85, 0.42, and 0.15 (or zero), cover the range. Either psychotherapy is astonishingly effective or as good as many treatments doctors use or a complete bust.
As for Eysenck, meta-analysis left him unimpressed. He wrote, “A good review is based on intimate personal knowledge of the field, the participants, the problems that arise, the reputation of different laboratories, the likely trustworthiness of individual scientists, and other partly subjective but extremely relevant considerations.”
Most overviews amounted to an abuse of data.
Glass had conducted one analysis that resembled the Wesleyan group’s. He examined trials with three arms, contrasting psychotherapy, medication, and placebo. These studies enroll patients with real disorders—no imipramine for healthy undergrads. Also, research can suffer from “allegiance bias”: psychotherapy trials conducted by interested parties show inflated results. (Where experimenters had an allegiance to the therapy under study, Glass found an effect size close to one—a lot.) In three-armed trials, pharmacologists’ preferences balance therapists’.
Treating serious patients in more neutral studies, psychotherapy showed an effect size of 0.3, low to medium. Medication yielded an effect size just over 0.5, medium, and Glass argued for cutting that figure to 0.4. For antidepressants such as imipramine, effect sizes ran between 0.4 and 0.5—typical for treatments in clinical medicine. How bad was therapy’s 0.3? Glass considered psychotherapy to be “scarcely any less effective than drug therapy in the treatment of serious psychological disorders.” For Glass, effect sizes between 0.3 and 0.5 in imperfect trials signaled useful treatments.
As we shall see later, that range encompasses most of the recent estimates of antidepressant efficacy, with psychotherapy perhaps running slightly lower. Critics of medical model psychiatry complain that antidepressants were hyped from early on. But Glass’s meta-analyses, at the birth of the method, look to be on target.
As for how much faith we should put in meta-analysis, the debate over Glass’s work might make us wonder. What does it mean that a collection of studies can give rise to three efficacy estimates: high, medium, and low?
Let’s think about what meta-analysis tries to do. The ideal in outcome research is a randomized trial large enough to settle an issue definitively, what’s called a gold standard trial. One thousand has come to be accepted as the minimum number of patients for gold standard designation. Many efforts are larger. When researchers worried about the effects of hormone replacement in menopause, they ran controlled trials involving between ten thousand and sixteen thousand women. High-enrollment programs are convincing but expensive. There are no gold standard trials in psychiatry—and not enough in the rest of medicine either. Meta-analysis purports to achieve similar certainty by combining small trials.
By the early 1990s, meta-analysis had moved into the medical mainstream. In one influential effort, from 1992, researchers examined the use of streptokinase, a medicine that dissolves blood clots, in the wake of heart attacks. They found that if statisticians had performed a cumulative meta-analysis each time a new small trial entered the literature, they would have confirmed the effectiveness of the intervention before gold standard trials on the topic were conducted—and saved lives by speeding the acceptance of streptokinase use.
This inquiry was less about levels of efficacy—How much?—than statistical significance. The amalgamated data gave an early indication that streptokinase improved survival. The studies under analysis were of two sorts: large trials, involving five hundred or seven hundred patients, and small trials in which the effects of treatment showed up consistently, time after time. Large trials and trials with consistent results give clear signals in meta-analyses—but, of course, even without a special statistical technique doctors might find that accumulation of results convincing.
Is there value in combining results from small trials with conflicting findings? And can meta-analysis answer How much?
In 1997, doctors from the University of Montreal, writing in The New England Journal of Medicine, contrasted meta-analysis and the gold standard. The team identified a dozen large, randomized trials, each testing at least a thousand patients, where the same question—Should the treatment be used?—had been answered in prior meta-analyses, forty in all. The topics were diverse: streptokinase, but also magnesium in the treatment of heart attacks, chemotherapies for breast cancer, and so on. The outcomes measured were straightforward. The commonest was death.
The news from Montreal was discouraging. When a gold standard trial showed that a treatment worked, a third of the time the prior meta-analyses had failed to find efficacy. When the gold standard trial found insufficient or negative evidence, a third of the time “the meta-analysis would have led to the adoption of an ineffective treatment.”
Meta-analyses lead doctors in the right direction four times in six. Coin flips would lead them in the right direction three times in six. But then, doctors can do better than fifty-fifty, as when they make note of areas where many small studies and a large one point in the same direction, areas where meta-analyses only confirm the obvious.
In the Montreal study, what meta-analysis proved worst at was what it was designed for, measuring the magnitude of effects. The Canadian researchers came down where Hans Eysenck had. They looked favorably on the suggestion that doctors read trials individually and exercise judgment.
Why can’t meta-analysis perform alchemy, turning a heterogeneous mixture (of data) into gold? There are many reasons, some highly technical, but one that is easy to understand concerns what is effectively a loss of randomization.
Meta-analyses are studies of controlled trials. Each trial arrives at the triage desk and is accepted or rejected. And most meta-analyses are performed after the fact. Experts who know the research literature propose entry criteria, admitting this sort of study and not that one, and then run the numbers. Like psychotherapy trials, meta-analyses display allegiance bias. The outcomes favor views held by authors who have a professional or financial stake in the result. Researchers found this pattern in competing meta-analyses on straightforward questions such as whether formaldehyde exposure causes leukemia.
Even where everyone’s hands are on the table, meta-analysis can prove tricky. A recent example involves surgical safety checklists. Reformers had hoped that if before, during, and after an operation, nurses catechized the surgical team on issues such as whether the incision site was properly marked, then patients would fare better. A celebrated meta-analysis of twenty-two small trials—it led to widespread adoption of the operating-room discipline—found that checklists reduced complications and deaths dramatically, by 40 percent. But a subsequent study of 200,000 surgical operations in Ontario, half performed before and half after the implementation of checklists, found no benefit: no fewer deaths, no fewer return emergency-room visits, and no fewer surgical complications.
Here, probably the problem was what gets published. Hospitals whose innovations fail bury the result. Successful efforts get written up. Meta-analysis is at the mercy of what makes its way into the literature.
Public health advocates had celebrated the early meta-analysis, and the disappointing large-scale research tested their devotion to evidence. Their tendency was to defend checklists, based on a detailed examination of the virtues of this or that program—again, the close reading of small studies. In truth, medicine relies on judgment: doctors have a good sense of what works, they find confirmation in well-performed research, and they feel justified in discounting even large-scale studies, based on possible flaws. In Ontario, was the training adequate? Public health experts still support checklists, although with acknowledgment that they are unlikely to offer the impressive benefits first reported. On the How much? question, the meta-analysis had been misleading.
Few discussions of the antidepressant controversy acknowledge this truth: In psychiatry, we lack large trials, and meta-analyses are imperfect substitutes. The particular shortcomings are significant. The antidepressant controversy is about How much? On the simple question of whether antidepressants outperform placebos, the meta-analyses are in agreement: Medications work. The debate is over the magnitude of the effect size.
This dispute has a fantastical aspect. Meta-analysis does not constitute sublime guidance. Even in neutral hands, it provides only suggestive findings. Often, it amounts to argumentation.
Since meta-analyses are what we have, in this book I will refer to them repeatedly. But often I will try to supplement their findings through attention to the constituent trials. That’s where the fun is, and the promise of wisdom as well, in the quirks of carefully conducted small-scale research.
Still, whatever the limitations of Glass’s invention, its influence would be hard to overstate. In mental health care, when experts argue over the worth of treatments, almost always it is meta-analysis that drives the debate.