Ordinarily Well: The Case for Antidepressants


No Myth

DISCUSSING THE INSPIRATION for this book, I mentioned Nora, whose life was restored when she began taking antidepressants and who entertained doubts about their worth. In my practice, encounters of this sort became common in the wake of Kirsch’s “Emperor” article. But the other trend that concerned me, doctors’ dismissal of antidepressants, gained momentum only later. I date that change to 2008, when a justly influential study appeared in The New England Journal of Medicine.

Efficacy was not the main topic. The lead author, Erick Turner, from Oregon Health and Science University, would later express concern that his contribution had been misinterpreted as a blanket rejection of antidepressants. But Turner had analyzed the FDA data, and like Kirsch, he had found modest drug effects.

Turner was reviewing publication practices. Considering research submitted to the FDA, he found that medical journals had given space only to reports of studies that validated antidepressants. Results from unfavorable trials either did not find their way into print or were presented in combination with other research, in ways that might obscure the disappointing outcomes. The exceptions involved Prozac and a form of Paxil, for which full data had long been available because of prior protests about the manufacturers, Eli Lilly and GlaxoSmithKline.

Some trials, Turner understood, might have been so flawed that they did not merit publication. Also unclear was how journal editors ought to have proceeded; preliminary results showing modest efficacy for an experimental drug, one that might never make it to market, had little news value. Still, no one doubted that the pharmaceutical industry was inclined to conceal unfavorable information. Surely, for every drug that did gain approval, doctors and the public should have had access to all the data.

New research rules were promulgated in 2007, before the Turner exposé. Drug companies were required to register trials in advance and post results to a public database. But Turner’s paper reinforced the movement for transparency.

For our purposes, the important figure in Turner’s paper was an overall effect size, 0.31 (signaling a modest impact), that matched Kirsch’s calculation. Turner took pains to avoid misinterpretation: each antidepressant remained demonstrably superior to placebo. Discussing whether antidepressants were worth taking, Turner cited the quality-of-life research—patients whose depressive symptoms linger still enjoy improved well-being. And he rejected the NICE standard, saying that it was “doubtful that [Jacob Cohen] would have endorsed NICE’s use of an effect size of 0.5 as a litmus test for drug efficacy.”

Disclaimers notwithstanding, because of that 0.31 figure, Turner’s paper lent support to the view that antidepressants have little to offer. But his analysis, too, had been dominated by drugs that had accumulated disappointing trials, and the FDA’s standards guaranteed that there would be many.

I have never considered the FDA data to be a good source of information about drug efficacy. In “candidate-drug trials,” as the research on proposed new medicines is called, contact with raters is extensive and supportive, so placebo response rates run high. The patients who sign on rarely resemble the depressed patients I have seen, now or in my training. In some cases, drug doses are held low. The research simply does not address the question that interests us: to what extent medication, thoughtfully administered, is likely to help a typical depressed person.

That said, when the Turner report appeared, I wondered whether the sample he had assembled—of trials never meant to show full drug efficacy—might nonetheless contain some usable information on the way antidepressants might work in clinical practice.

Because the FDA believes that it is difficult to recruit representative patient samples—here, groups with typical cases of depression—the agency favors trials with comparators. If in a candidate-drug trial, an antidepressant with known reliability (such as imipramine) does not outperform placebo, the trial is said to have “failed.” The presumption is that the experiment was poorly executed or involved an unrepresentative group of patients. In that case, if the new antidepressant stumbles, too, the FDA will be especially forgiving.

Kuhn’s opinion was that what trials need most is suitable patients. The FDA was in accord. I wondered what would happen if we adopted the agency’s view that the most informative tests are those where a reliable antidepressant such as imipramine succeeds and “validates the sample.”

Comparator-validated trials might be interesting for a second reason. Psychologists who consider the classic placebo effect important in depression treatment have been proposing a new sort of evidence. They say that trials with three arms—two antidepressants plus dummy pills—produce elevated placebo response rates. The thought is that when participants know that they have a two-in-three chance of being on an antidepressant, they will have heightened expectations and do especially well—better than in trials where the odds are fifty-fifty. In the more complex trials, drug response rates rise, too, although to a lesser degree. Because of the high placebo response rates, trials with comparators should pose an especially tough challenge for medication.

Knowing that two colleagues, Michael Thase, of the University of Pennsylvania, and Arif Khan, of Duke University, had the FDA material banked in their computer systems, I asked them to do a run of trials with comparators. The results have not been published, which means that they have not gone through editorial and peer review. Still, the statistical procedure is straightforward, based on files available to many research teams.

The FDA collection contained thirty-four candidate-drug trials (with 8,134 patients) that had three arms: placebo, comparator, and a new drug that was eventually approved. Imipramine and Elavil played the benchmarking role most often, but, as patients began to avoid studies with older drugs, new antidepressants, often Paxil, filled in.

In nineteen trials, the comparator failed. In them, the effect size for the new drug over placebo was 0.29, about what Kirsch and Turner found in their analyses. In aggregate, then, the FDA studies—most conducted without a comparator—resemble failed three-arm trials.

In fifteen trials, the comparator succeeded. Its benefit over placebo reached statistical significance. There, where the patient sample proved valid, the effect size for the candidate drug was 0.45.

That figure is familiar. It parallels the effect size for imipramine in Gene Glass’s research. It corresponds to a number needed to treat of 4.

In this complete collection of studies, response rates were effectively identical in three- and two-armed trials of the same antidepressants. If patients do calculate the odds of receiving medication, then drug trials don’t show much in the way of classic placebo effects. The rise in placebo responses over the years is more likely due to the supportive factors in drug trials—we will get a glimpse of them shortly—and increasing problems with enrollment. Typical depressed patients do not sign on as subjects in candidate-drug trials. In that context, comparator-validated trials, ones in which imipramine performs as expected, become all the more important.

When I mentioned the comparator-validated results to John Davis, he shared analyses from his own wide-ranging review, also as yet unpublished. They told the same story. Davis had examined an extensive collection of trials, some conducted in academic settings, some unpublished, in which imipramine had been used as a comparator for Prozac, Zoloft, or Celexa. In studies involving thousands of patients, the new drugs matched the old consistently, and both outperformed placebo, with effect sizes overall in the 0.5 range, typical for medical treatments in general. (Because Davis’s collection included some high-quality trials, the effect sizes ran higher than those calculated from the FDA collection.) As Kuhn’s curse would predict, decade by decade the measured efficacy has fallen for all drugs, but in each time period, the new antidepressants kept pace with imipramine.

Davis’s analysis should prove important. In its graphs, each of the new drugs tracks imipramine, equaling it in efficacy, but with a steady falloff over time, a sign that it is the trials, not the drugs, that are failing. The FDA’s methods, whatever their flaws, have provided doctors with a set of medications that perform as well as imipramine.

Unpublished studies can be only so convincing, so we are lucky to have published analyses to bring to bear. Per Bech and his colleagues have used a different approach to finding valid information in the FDA files, by using their compact Hamilton scale. That choice might mitigate a flaw in the data. If raters game the system, admitting patients whose main claims for entry are headache and constipation, the Bech scale will ignore widely prevalent irrelevancies.

Bech examined a collection of studies similar to Turner’s, including the unpublished research. Wherever Bech could break down the data—wherever he could run the numbers on core depression factors—antidepressants worked, so long as they were used in full doses. Bech found effect sizes ranging from about 0.4 for Prozac to 0.6 for Lexapro.

Even looking at failed and unsuccessful trials, sweeping in all the data, if you limit your attention to core symptoms, you find that antidepressants perform at acceptable levels. (The strength of the numbers may reflect medications’ ability to ameliorate what Bech calls the dimension of depression, the cluster of core depressive symptoms as they appear even in sloppily diagnosed patients.) Our doubts about efficacy reflect artifacts of measurement—confounds in trials that evidence-based medicine tends to embrace. Pare down Max Hamilton’s scale to the essentials, and, to quote Bech’s charming conclusion about Prozac and its fellows, you see that “no such myth of mere placebo activity is in operation for second-generation antidepressants.”

In his New England Journal paper, for each antidepressant Turner had contrasted the high apparent effect sizes that a reader might have arrived at if he had access only to the published studies and a lower effect size that emerged from the complete data set, including unpublished studies. Applying the shortened scale, Bech found effect sizes for Prozac, Celexa, Lexapro, and Cymbalta higher than those that Turner had derived from the published, favorable studies. (One antidepressant, Remeron, fell just shy of matching the published FDA number.) The result held for full drug doses and most often for lesser doses as well. The apparently inflated estimates of antidepressant efficacy, the ones calculated based on uniformly positive published studies, were arguably too low.

Bech’s results suggest that while psychiatry had corporate, professional, academic, publishing, and oversight scandals, as far as treatment efficacy was concerned, there was no clinical scandal. Arguably, the most relevant effect sizes for antidepressants were higher than the published ones. Doctors don’t go through life mentally integrating trial results; but any automaton who had regulated his practice that way would have been in danger of underprescribing.

There’s no excuse for drug companies’ and the FDA’s failing to inform doctors and the public fully. At the same time, I don’t think that it’s coincidence when a scale representing the clinical viewpoint documents levels of efficacy that correspond to doctors’ impressions of how well antidepressants work.

These approaches—looking at core symptoms, comparator-validated trials, trajectories, and response rates—all find ordinary efficacy for antidepressants, even in the candidate-drug trials.

Advocates of evidence-based medicine complain about doctors’ tendency to stick with what has worked for their patients; the profession should be more responsive to research results. But clinical experience acts as ballast, often usefully. Kirsch’s “Emperor” paper and Turner’s critique of publication practices arrived in the early years of the new millennium. The reports, with their calculations of modest effect sizes, may have led some doctors to stop prescribing antidepressant medication or to cut back on its use. I feared (and observed) that result. But subsequent analyses have been reassuring. Antidepressants work as well as they ever have. Doctors’ steady judgment may have been more accurate, more useful in the service of good patient care, than the fluctuating research results.

Finally, here’s my take on the FDA studies: They performed their job, allowing the agency to identify useful medicines, ones that patients can live with, ones that have transformed the face of depression. Beyond that, the trials are a lousy source of information about antidepressant efficacy, and it’s shocking that an important medical question, about the proper treatment of mood disorder, has been debated using them as a reflection of reality.

As for how flawed the testing process is, we’ll see in a moment—next—when we take a trip to a clinical center to observe commercially sponsored drug trials. They’re disturbing. Reading outcome studies, we may envisage patients coming to see their doctors and getting either medication or identical-looking placebo pills. Industry drug testing is no longer like that. Industry drug testing is an industry in itself, one that produces results ever less reflective of typical clinical encounters.

If you find an error or have any questions, please email us at admin@doctorlib.info. Thank you!