IT MAY APPEAR that with the antidepressant, the rating scale, the randomized trial, and the expert literature review, psychiatry had entered the era of science-based medicine for good and all. Instead, there was resistance, and from unexpected quarters.
By the mid-1960s, Sir Austin Bradford Hill, the principal designer of the British streptomycin study, had become uneasy about the increasing emphasis on the sort of research he had initiated. The pendulum had swung too far. Certain important issues would never be settled through randomized trials. Hill had a special interest in the link between smoking and lung cancer. No one was going to randomly assign people to smoke cigarettes for decades. Judging causes and outcomes, doctors would need to consider other factors, such as biological plausibility and the overall coherence of a body of evidence.
In a lecture that touched on mental health research, Hill went further: “Any belief that the controlled trial is the only way would mean not that the pendulum had swung too far but that it had come right off its hook.” Hill doubted that blinding was appropriate in assessing treatments for disorders, like anxiety, with subjective symptoms. To optimize outcomes, doctors would need to adjust doses and observe responses, aware of who was on what—and the clinician’s perception might be the most accurate gauge of results. Taking what looks like a swipe at the Hamilton scale, Hill quoted a colleague’s comment that it is “ridiculous to scorn subjective assessments in subjective symptoms, and it is unrealistic to make artificially objective assessments.” As an example of a task that would require a flexible combination of approaches, experimental and clinical, Hill chose the evaluation of antidepressants.
Roland Kuhn, too, expressed mistrust of rating scales and controlled trials: “In clinical research, most of the statistics are useless…” In the 1990s, looking back on psychiatry’s failures—no one had found an antidepressant more effective than imipramine—Kuhn said:
My methods were entirely different from those which are nowadays applied in clinical research. I have never used “controlled double-blind studies” with “placebos,” “standardised rating scales” or the statistical treatment of records of large numbers of patients.
Instead I examined each patient individually even every day, often on several occasions, and questioned him or her again and again.
Kuhn had been interested in social functioning as much as symptom relief. What impressed him about imipramine was its capacity to give patients back their lives.
More, Kuhn argued that clinical trials on depressed patients had become impossible because potential research subjects would already have been treated (for instance, with imipramine) by the family physician.
As we have seen, not all doctors or patients believed in antidepressant use. But for those open to drug treatment, imipramine and similar medicines were readily available. Internists likely to diagnose depression generally prescribed for it on their own. Increasingly, the patients who entered trials were those who had already failed on medication. As a result, research was conducted on a population that did not represent the full range of depressed people. Patients who have proved “refractory to treatment” will be less likely to respond to the next remedy tried. The result might be a failure to identify useful drugs. Psychiatric research was a victim of its own past successes.
The situation, Kuhn complained, had been different in the 1950s: “The cases then were much better suited for trials because those today who are suitable for trials don’t come anymore to the psychiatrist and even less into clinic. The clinical picture has completely changed because of treatment.” I call this difficulty—the inability to recruit representative patients into trials—the curse of Roland Kuhn. We will encounter it repeatedly.
Both Hill and Kuhn wondered whether psychiatry was the right domain for rating scales. This objection may sound antique. For half a century and more, researchers have utilized the Hamilton. Much of what we know about depression—how prevalent it is, what harm it causes, where it registers in the brain, and, yes, which treatments mitigate it—comes thanks to the Hamilton. But it may be that depression is so distinctive and so harmful that its effects emerge even with imperfect measurement.
Because the vast majority of research on antidepressants has used the Hamilton, we will want to know whether the scale is more accurate than Kuhn’s method, coming to understand the patient through conversation over long acquaintance. Many researchers complain that the scale has become a lead weight—that it interferes with our ability to evaluate the efficacy of antidepressants. Statisticians’ objections to the Hamilton tend to be technical, but we can understand some problems readily enough.
The scale can compact two ailments into one, with confusing results. Consider Max Hamilton’s emphasis on physical complaints, such as constipation. If a patient is hypochondriacal and mildly depressed, when a rater adds up the symptom scores, the scale will depict the depression as severe. If a treatment works for that patient, we may conclude that it works for severe depression—when mild depression was at issue. Equally, a medicine that eliminates only the depression (and not the hypochondriasis) will appear half-effective, even though the mood disorder is gone.
Also, the scale equates unequal aspects of the illness. Four points assigned for insistent suicidality are equated to four points attached to four scattered mild symptoms, so that one patient with a middling Hamilton score may be much sicker than another. By 1975, researchers had found that in hospitalized patients, the Hamilton could no longer distinguish between moderate and severe depression, as rated by attending physicians.
The Hamilton is a prisoner of history. The scale captures the depression that Max Hamilton encountered, often an agitated state accompanied by insomnia. Questioned years later, Alan Broadhurst conceded that he had chosen a scale suited to showcasing imipramine, an antidepressant that tends to be calming and sedating. Newer, more energizing antidepressants are disadvantaged. That’s another area where the Hamilton scale introduces inaccuracies: comparisons of antidepressants.
These defects are pretty bad. We want a rating scale to contrast treatments accurately. We want it to measure and track patients’ level of depression. We don’t want patients with mild mood disorders admitted into studies of severe depression.
The scale survived and flourished largely because of its priority. To allow for comparison with old trials, new ones used the scale. The FDA played a role. To prevent pharmaceutical houses from gaming the system, selecting measures matched to a drug being tested, the agency favored the Hamilton. But that the field was paying a price—relying on data that had an ever-more-approximate relationship to the course of patients’ illness—was an open secret.
Working in the 1970s, a Danish group headed by the young psychiatrist Per Bech decided to rework the Hamilton so that it would once more serve its original intended role, representing depression as doctors see it. The team asked respected clinicians to rate patients they knew well. Comparing those impressions to Hamilton ratings, Bech and his colleagues found the essence of depression to reside in six factors: depressed mood, guilt, functioning at work and similar tasks, psychomotor retardation (again, a slowing of mind and body), psychic anxiety, and a cluster of negative feelings that included tiredness and pain. The scale did yet better if you expanded “guilt” to include low self-esteem.
Those patients who scored high on the six factors were the ones doctors had called most depressed. When depression became more severe, scores on those items rose. As depression abated, scores fell. The factors clustered, often rising and falling in sync, defining an entity that Bech called core depression. The six items corresponded to depression as doctors see it, and the collection behaved well statistically. This “good behavior” has held up over thirty-five years and more.
The ill-behaved Hamilton factors included suicidality, agitation, weight loss, insomnia, diminished sexual interest, poor insight, and a host of bodily symptoms and concerns. Some items were hard to rate, so that observers disagreed about their severity. Some rose as core items fell. (Notoriously, suicide risk can increase as patients begin to recover from depression.) Some were persistent, seemingly independent of the level of depression.
It’s not that the peripheral items bore no relationship to depression. Suicidality can signal mood disorder. But their fluctuations did not do the job that Hamilton had proposed: to represent the changing burden of illness.
Max Hamilton traveled to Copenhagen in 1977 and, in a gracious lecture, largely accepted the Danes’ verdict. Of his scale’s factors, “Six of them did all the work and the other eleven were, so to speak, passengers which interfered with the work.”
Bech called his pared-down scale the HAM-D6, where the original Hamilton became the HAM-D17. The six-factor scale had numerous successes. For tricyclics, it gave a cleaner account of the relationship between dose and response—more medication meant greater efficacy, up to a point—where the full scale did not. The short scale turned out to be sensitive, picking up improvements in mood disorder early in trials, before the full scale registered the change. And the scale eliminated factors, such as insomnia and agitation, that might respond incidentally to imipramine. On the HAM-D6, tricyclics continued to look effective, but arguably (the statistics were inconclusive) at a slightly lower level than the full Hamilton had made them out to be.
I have often wondered why Bech’s simplification has not replaced the original Hamilton. Perhaps the problem is that very virtue, simplicity. The Bech is too reflective of the doctor’s quick insight: sick or not. If the six items measure the disorder, they do not define it—do not highlight the signal roles of insomnia and suicidality in producing the burden of depression. Hamilton’s longer scale seems more mechanical and therefore scientific. Regulators may imagine that, as researchers work their way through seventeen items, in the tedium they lose track of their private judgment and produce something free of opinion, something precise and trustworthy.
All the same, we will need to remember Per Bech’s contribution. Developed years before the appearance of serotonin-based antidepressants like Prozac, the short scale would later capture the strengths of the newer medications. They alleviate core symptoms of depression. We’re out ahead of ourselves here, but the contemporary antidepressant controversy arises largely from the peculiarities of the Hamilton scale, with its focus on symptoms such as headache that wax and wane spontaneously or stay fixed no matter the treatment. In compiling his instrument, Max Hamilton legitimated the tricyclics and incidentally set in place a mechanism for casting the worth of later antidepressants in doubt.
Despite the preference for complexity, there’s much to be said for the gestalt, for doctors’ summary impression of their patients’ level of depression. When statisticians limit their attention to a symptom collection that tracks the informed gut call, they emerge with coherent data that confirm what doctors see. The result is not happenstance. Bech intended for the revised scale to capture the clinical perspective.
Here’s where things stood when I began my training, in the late 1970s. Antidepressants had shown their efficacy in early trials, but the glory days were over. Research would become ever harder to conduct. Outcome data—so long as the full-scale Hamilton remained in place—would prove ever less valid. And as Hill had predicted, rigorous trials would miss truths apparent to practicing doctors free to work flexibly with patients they knew well.