KEY CONCEPTS
|
PRESENTING PROBLEMS
Presenting Problem 1
A 57-year-old man presents with a history of low back pain. The pain is aching in quality, persists at rest, and is made worse by bending and lifting. The pain has been getting progressively worse, and in the past 6 weeks has been awakening him at night. Within the past 10 days he has noticed numbness in the right buttock and thigh and weakness in the right lower extremity. He denies fever, but has had a slight loss of appetite and a 10-lb weight loss over a period of 4 months. He has no prior history of low back pain and his general health has been good. The physical examination reveals a temperature of 99.6F, tenderness in the lower lumbar spine, a decrease in sensation over the dorsal and lateral aspect of the right foot, and weakness of right ankle eversion. Deep tendon reflexes were normal.
Based on your review of the literature, the patient's history and physical examination, you suspect the man has a 20–30% chance of a spinal malignancy. You must decide whether to order an erythrocyte sedimentation rate (ESR) or directly order imaging studies, such as a lumbar MRI. Joines and colleagues (2001) compared several strategies for diagnosing cancer in patients with low back pain. They reported the sensitivity and specificity for different diagnostic procedures, including an ESR ≥ 20 mm/h and several imaging studies. They reported a sensitivity of 78% and specificity of 67% for an ESR ≥ 20 mm/h and sensitivity and specificity of 95% for lumbar MRI. They developed several diagnostic strategies, or decision trees, for investigating the possibility of cancer in primary care outpatients with low back pain and determined the cost for each diagnostic strategy using data from the year 2000 Medicare reimbursements. Strategies were arranged in order of cost per patient and compared with the number of cases of cancer found per 1000 patients. We use information from their study to illustrate sensitivity and specificity of a diagnostic procedure and again to illustrate the use of decision trees to compare strategies.
Presenting Problem 2
The electrocardiogram (ECG) is a valuable tool in the clinical prediction of an acute myocardial infarction (MI). In patients with ST segment elevation and chest pain typical of an acute MI, the chance that the patient has experienced an acute MI is greater than 90%. In patients with left bundle-branch block (LBBB) that precludes detection of ST segment elevation, however, the ECG has limited usefulness in the diagnosis of acute MI. An algorithm based on ST segment changes in patients with acute MI in the presence of LBBB showed a sensitivity of 78% for the diagnosis of an MI. A true-positive rate of 78% means that a substantial proportion of patients (22%) with LBBB who presented with acute MI would have a false-negative test result and possibly be denied acute reperfusion therapy.
Shlipak and his colleagues (1999) conducted a historical cohort study of patients with acute cardiopulmonary symptoms who had LBBB to evaluate the diagnostic test characteristics and clinical utility of this ECG algorithm for patients with suspected MI. They used their results to develop a decision tree to estimate the outcome for three different clinical approaches to these patients: (1) treat all such patients with thrombolysis, (2) treat none of them with thrombolysis, and (3) use the ECG algorithm as a screening test for thrombolysis.
Eighty-three patients with LBBB who presented 103 times with symptoms suggestive of MI were studied. Nine individual ECG predictors of acute MI were evaluated. None of the nine predictors effectively distinguished the 30% of patients with MI from those with other diagnoses. The ECG algorithm had a sensitivity of only 10%. The decision analysis estimated that 92.9% of patients with LBBB and chest pain would survive if all received thrombolytic therapy, whereas 91.8% would survive if treated according to the ECG algorithm. Data summarizing some of their findings are given in the section titled, “Measuring the Accuracy of Diagnostic Procedures.” We use some of these findings to illustrate sensitivity and specificity.
Presenting Problem 3
Congestive heart failure (CHF) is often difficult to diagnose in the acute care setting because symptoms are nonspecific and physical findings are not sensitive enough. B-type natriuretic peptide is a cardiac neurohormone secreted from the ventricles in response to volume expansion and pressure overload. Previous studies suggest it may be useful in distinguishing between cardiac and noncardiac causes of acute dyspnea. Maisel and colleagues (2002) with investigators from nine medical centers conducted a multinational trial to evaluate the use of B-type natriuretic peptide measurements in the diagnosis of CHF. A total of 1586 patients with the primary complaint of shortness of breath were evaluated in the emergency departments of the participating study centers; physicians assessed the probability that the patient had CHF without knowledge of the results of measurement of B-type natriuretic peptide. They used receiver operating characteristic (ROC) curves to evaluate the diagnostic value of B-type natriuretic peptide and concluded it was the single best predictor of the presence or absence of CHF. It was more accurate than either the NHANES criteria or the Framingham criteria for CHF, the two most commonly used sets of criteria for diagnosing CHF.
Presenting Problem 4
Lead poisoning is an important disease among children. Exposure, often from house dust contaminated from crumbling old, lead-based paint, may be associated with a range of health effects, from behavioral problems and learning disabilities, to seizures and death. An estimated 21% of housing in the United States has deteriorated lead-based paint and is home to one or more children under 6 years of age. Nearly 900,000 children have blood lead (BPb) levels > 10 ľg/dL, the level of concern established by the CDC. What are the costs and benefits of housing policy strategies developed to prevent additional cases of childhood lead poisoning?
Dr. M. J. Brown of the Harvard School of Public Health developed a cost–benefit analysis comparing two policy strategies for reducing lead hazards in housing of lead-poisoned children (2002). She used data from a historical cohort study she had undertaken previously that analyzed data on all lead-poisoned children in two adjacent urban areas over a 1-year period. The two areas were similar except that one employed a strict enforcement of housing code in residential buildings where lead-exposed children lived and the other had limited enforcement of codes. She used a decision tree model to compare costs and benefits of “strict” versus “limited” enforcement of measures to reduce residential lead hazards. Outcome measures were: (1) short-term medical and special education costs associated with an elevated BPb level in one or more additional children after the initial case and (2) the long-term costs of decreased employment and lower occupational status associated with loss of IQ points as result of lead exposure.
She found that the risk of finding additional children with lead poisoning in the same building was 4.5 times greater when the “limited” building code enforcement strategy was used. The cost to society of recurrent BPb level elevations in residential units where lead-poisoned children were identified was greater than the cost of abatement.
Presenting Problem 5
Invasive carcinoma of the cervix occurs in about 15,000 women each year in the United States. About 40% ultimately die of the disease. Cervical carcinoma in situ is diagnosed in about 56,000 women annually, resulting in approximately 4800 deaths. Papanicolaou (Pap) smears play an important role in the early detection of cervical cancer at a stage when it is almost always asymptotic.
Although the American Cancer Society recommends annual Pap smears for at least 3 years beginning at the onset of sexual activity or 18 years of age, then less often at the discretion of the physician, only 12–15% of women undergo this procedure. The Pap smear is considered to be a cost-effective tool, but certainly imperfect—it has a sensitivity rate of only 75–85%. New technologies have improved the sensitivity of Pap testing but at an increased cost per test. Brown and Garber (1999) assessed the cost-effectiveness of three new technologies in the prevention of cervical cancer morbidity and mortality.
INTRODUCTION
“Decision making” is a term that applies to the actions people take many times each day. Many decisions—such as what time to get up in the morning, where and what to eat for lunch, and where to park the car—are often made with little thought or planning. Others—such as how to prepare for a major examination, whether or not to purchase a new car and, if so, what make and model—require some planning and may even include a conscious outlining of the steps involved. This chapter addresses the second type of decision making as applied to problems within the context of medicine. These problems include evaluating the accuracy of diagnostic procedures, interpreting the results of a positive or negative procedure in a specific patient, modeling complex patient problems, and selecting the most appropriate approach to the problem. These topics are very important in using and applying evidence-based medicine; they are broadly defined as methods in medical decision making or analysis. They are applications of probabilistic and statistical principles to individual patients, although they are not usually covered in introductory biostatistics textbooks
Medical decision making has become an increasingly important area of research in medicine for evaluating patient outcomes and informing health policy. More and more quality-assurance articles deal with topics such as evaluating new diagnostic procedures, determining the most cost-effective approach for dealing with certain diseases or conditions, and evaluating options available for treatment of a specific patient. These methods also form the basis for cost–benefit analysis.
Correct application of the principles of evidence-based medicine helps clinicians and other health care providers make better diagnostic and management decisions. Kirkwood and colleagues (2002) discuss the abuse of statistics in evidence-based medicine.
Those who read the medical literature and wish to evaluate new procedures and recommended therapies for patient care need to understand the basic principles discussed in this chapter.
We begin the presentation with a discussion of the threshold model of decision making, which provides a unified way of deciding whether to perform a diagnostic procedure. Next, the concepts of sensitivity and specificity are defined and illustrated. Four different methods that lead to equivalent results are presented. Then, an extension of the diagnostic testing problem in which the test results are numbers, not simply positive or negative, is given using the ROC curves. Finally, more complex methods that use decision trees and algorithms are introduced.
EVALUATING DIAGNOSTIC PROCEDURES WITH THE THRESHOLD MODEL
Consider the patient described in Presenting Problem 1, the 57-year-old man who is concerned about increasing low back pain. Before deciding how to proceed with diagnostic testing, the physician must consider the probability that the man has a spinal malignancy. This probability may simply be the prevalence of a particular disease if a screening test is being considered. If a history and a physical examination have been performed, the prevalence is adjusted, upward or downward, according to the patient's characteristics (eg, age, gender, and race), symptoms, and signs. Physicians use the term “index of suspicion” for the probability of a given disease prior to performing a diagnostic procedure; it is also called the prior probability. It may also be considered in the context of a threshold model (Pauker and Kassirer, 1980)
The threshold model is illustrated in Figure 12-1A. The physician's estimate that the patient has the disease, from information available without using the diagnostic test, is called the probability of disease. It helps to think of the probability of disease as a line that extends from 0 to 1. According to this model, the testing threshold, Tt, is the point on the probability line at which no difference exists between the value of not treating the patient and performing the test. Similarly, the treatment threshold, Trx, is the point on the probability line at which no difference exists between the value of performing the test and treating the patient without doing a test. The points at which the thresholds occur depend on several factors: the risk of the diagnostic test, the benefit of the treatment to patients who have the disease, the risk of the treatment to patients with and without the disease, and the accuracy of the test.
Figure 12-1. Threshold model of decision making. A: Threshold model. B: Accurate or low-risk test. C: Inaccurate or high-risk test. (Adapted and reproduced, with permission, from Pauker SG, Kassirer JP: The threshold approach to clinical decision making. N Engl J Med 1980; 302: 1109–1117.) |
Figure 12-1B illustrates the situation in which the test is quite accurate and has very little risk to the patient. In this situation, the physician is likely to test at a lower probability of disease as well as at a high probability of disease. Figure 12-1C illustrates the opposite situation, in which the test has low accuracy or is risky to the patient. In this case, the test is less likely to be performed. Pauker and Kassirer further show that the test and treatment thresholds can be determined for a diagnostic procedure if the risk of the test, the risk and the benefit of the treatment, and the accuracy of the test are known.
MEASURING THE ACCURACY OF DIAGNOSTIC PROCEDURES
The accuracy of a diagnostic test or procedure has two aspects. The first is the test's ability to detect the condition it is testing for, thus being positive in patients who actually have the condition; this is called the sensitivity of the test. If a test has high sensitivity, it has a low false-negative rate; that is, the test does not falsely give a negative result in many patients who have the disease
Sensitivity can be defined in many equivalent ways: the probability of a positive test result in patients who have the condition; the proportion of patients with the condition who test positive; the true-positive rate. Some people use aids such as positivity in disease or sensitive to disease to help them remember the definition of sensitivity.
The second aspect of accuracy is the test's ability to identify those patients who do not have the condition, called the specificity of the test. If the specificity of a test is high, the test has a low false-positive rate; that is, the test does not falsely give a positive result in many patients without the disease. Specificity can also be defined in many equivalent ways: the probability of a negative test result in patients who do not have the condition; the proportion of patients without the condition who test negative; 1 minus the false-positive rate. The phrases for remembering the definition of specificity are negative in health or specific to health.
Sensitivity and specificity of a diagnostic procedure are commonly determined by administering the test to two groups: a group of patients known to have the disease (or condition) and another group known not to have the disease (or condition). The sensitivity is then calculated as the proportion (or percentage) of patients known to have the disease who test positive; specificity is the proportion of patients known to be free of the disease who test negative. Of course, we do not always have a gold standard immediately available or one totally free from error. Sometimes, we must wait for autopsy results for definitive classification of the patient's condition, as with Alzheimer's disease.
In Presenting Problem 2, Shlipak and colleagues (1999) wanted to evaluate the accuracy of several ECG findings in identifying patients with an MI. They identified 83 patients who had presented 103 times between 1994 and 1997 with chest pain. It was subsequently found that 31 patients had an MI and 72 did not. The investigators reviewed the ECG findings and noted the features present; information is given in Table 12-1.
Let us use the information associated with ST segment elevation ≥ 5 mm in discordant leads to develop a 2 × 2 table from which we can calculate sensitivity and specificity of this finding. Table 12-2 illustrates the basic setup for the 2 × 2 table method. Traditionally, the columns represent the disease (or condition), using D+ and D- to denote the presence and absence of disease (MI, in this example). The rows represent the tests, using T+ and T- for positive and negative test results, respectively (ST segment elevation ≥ 5 mm or < 5 mm).
True-positive (TP) results go in the upper left cell, the T+D+ cell. False-positives (FP) occur when the test is positive but no ST segment elevation is present, the upper right T+D- cell. Similarly, true-negatives (TN) occur when the test is negative in patient presentations that do not have an MI, the T-D-cell in the lower right; and false-negatives (FN) are in the lower left T-D+ cell corresponding to a negative test in patient presentations with an MI.
Table 12-1. Number of patients having the specified electrocardiogram criteria for acute myocardial infarction among the 31 patients with MI and the 72 without. |
|||||||||||||||
|
|||||||||||||||
Table 12-2. Basic setup for 2 × 2 table. |
|||||||||||||||
|
In Shlipak and colleagues' study, 31 patient presentations were positive for an MI; therefore, 31 goes at the bottom of the first column, headed by D+. Seventy-two patient presentations were without an MI, and this is the total of the second (D-) column. Because 6 ECGs had an ST elevation ≥ 5 mm in discordant leads among the 31 presentations with MI, 6 goes in the T+D+ (true-positive) cell of the table, leaving 25 of the 31 samples as false-negatives. Among the 72 presentations without MI, 59 did not have the defined ST elevation, so 59 is placed in the true-negative cell (T-D-). The remaining 13 presentations are called false-positives and are placed in the T+D- cell of the table. Table 12-3 shows the completed table.
Using Table 12-3, we can calculate sensitivity and specificity of the ECG criterion for development of an MI. Try it before reading further. (The sensitivity of an ST elevation ≥ 5 mm in discordant leads is the proportion of presentations with MI that exhibit this criterion, 6 of 31, or 19%. The specificity is the proportion of presentations without MI that do not have the ST elevation, 59 of 72, or 82%.)
Table 12-3. 2 × 2 table for evaluating sensitivity and specificity of test for ST elevation. |
|||||||||||||||||
|
USING SENSITIVITY & SPECIFICITY TO REVISE PROBABILITIES
The values of sensitivity and specificity cannot be used alone to determine the value of a diagnostic test in a specific patient; they are combined with a clinician's index of suspicion (or the prior probability) that the patient has the disease to determine the probability of disease (or nondisease) given knowledge of the test result. An index of suspicion is not always based on probabilities determined by experiments or observations; sometimes, it must simply be a best guess, which is simply an estimate lying somewhere between the prevalence of the disease being investigated in this particular patient population and certainty. A physician's best guess generally begins with baseline prevalence and then is revised upward (or downward) based on clinical signs and symptoms. Some vagueness is acceptable in the initial estimate of the index of suspicion; in the section titled, “Decision Analysis,” we discuss a technique calledsensitivity analysis for evaluating the effect of the initial estimate on the final decision
We present four different methods because some people prefer one method to another. We personally find the first method, using a 2 × 2 table, to be the easiest in terms of probabilities. The likelihood ratio method is superior if you can think in terms of odds, and it is important for clinicians to understand because it is used in evidence-based medicine. You can use the method that makes the most sense to you or is the easiest to remember and apply.
The 2 × 2 Table Method
In Presenting Problem 1, a decision must be reached on whether to order an ESR or proceed directly with imaging studies (lumbar MRI). This decision depends on three pieces of information: (1) the probability of spinal malignancy (index of suspicion) prior to performing any tests; (2) the accuracy of ESR in detecting malignancies among patients who are subsequently shown to have spinal malignancy (sensitivity); and (3) the frequency of a negative result for the procedure in patients who subsequently do not have spinal malignancy (specificity)
What is your index of suspicion for spinal malignancy in this patient before the ESR? Considering the age and history of symptoms in this patient, a reasonable prior probability is 20–30%; let us use 20% for this example.
How will this probability change with the positive ESR? With a negative ESR? To answer these questions, we must know how sensitive and specific the ESR is for spinal malignancy and use this information to revise the probability. These new probabilities are called the predictive value of a positive test and the predictive value of a negative test, also called the posterior probabilities. If positive, we order a lumbar MRI and, if negative, a radiograph, according to the decision rules used by Joines and colleagues (2001). Then we must repeat the process by determining the predictive values of the lumbar MRI or radiograph to revise the probability after interpreting the ESR.
The first step in the 2 × 2 table method for determining predictive values of a diagnostic test incorporates the index of suspicion (or prior probability) of disease. We find it easier to work with whole numbers rather than percentages when evaluating diagnostic procedures. Another way of saying that the patient has a 20% chance of having a spinal malignancy is to say that 200 out of 1000 patients like this one would have spinal malignancy. In Table 12-4, this number (200) is written at the bottom of the D+ column. Similarly, 800 patients out of 1000 would not have a fetus with anomalies, and this number is written at the bottom of the D- column.
The second step is to fill in the cells of the table by using the information on the test's sensitivity and specificity. Table 12-4 shows that the true-positive rate, or sensitivity, corresponds to the T+D+ cell (labeled TP). Joines and colleagues (2001) reported 78% sensitivity and 67% specificity for the ESR in detecting spinal malignancy. Based on their data, 78% of the 200 patients with spinal malignancy, or 156 patients are true-positives, and 200 – 156 = 44, are false-negatives (Table 12-5). Using the same reasoning, we find that a test that is 67% specific results in 536 true-negatives in the 800 patients without spinal malignancy, and 800 – 536 = 264 false-positives.
The third step is to add across the rows. From row 1, we see that 156 + 264 = 420 people like this patient would have a positive ESR (Table 12-6). Similarly, 580 patients would have a negative ESR.
Table 12-4. Step one: Adding the prior probabilities to the 2 × 2 table. |
|||||||||||||||||
|
|||||||||||||||||
Table 12-5. Step 2: Using sensitivity and specificity to determine number of true-positives, false-negatives, true-negatives, and false-positives in 2 × 2 table. |
|||||||||||||||||
|
Table 12-6. Step 3: Completed 2 × 2 table for calculating predictive values. |
||||||||||||||||||||||||||||||||||||||
|
The fourth step involves the calculations for predictive values. Of the 420 people with a positive test, 156 actually have spinal malignancy, giving 156/420 = 37%. Similarly, 536 of the 580 patients with a negative test, or 92%, do not have spinal malignancy. The percentage 37% is called the predictive value of a positive test, abbreviated PV+, and gives the percentage of patients with a positive test result who actually have the condition (or the probability of spinal malignancy, given a positive ESR). The percentage 92% is the predictive value of a negative test, abbreviated PV-, and gives the probability that the patient does not have the condition when the test is negative. Two other probabilities can be estimated from this table as well, although they do not have specific names: 264/420 = 0.63 is the probability that the patient does not have the condition, even though the test is positive; and 44/580 = 0.08 is the probability that the patient does have the condition, even though the test is negative.
To summarize so far, the ESR is moderately sensitive and specific for detecting spinal malignancy when used with a low index of suspicion. It provides only a fair amount of information; it increases the probability of spinal malignancy from 20% to 37% when positive, and it increases the probability of no spinal malignancy from 67% to 92% when negative. Thus, in general, tests that have high sensitivity are useful for ruling out a disease in patients when the test is negative; for that reason, most screening tests have high sensitivity.
Now we repeat the previous reasoning for the subsequent procedure, assuming that the man's ESR was positive; from Table 12-6, we know that the probability of spinal malignancy with a positive ESR is 37%. When a second diagnostic test is performed, the results from the first test determine the prior probability. Based on a positive ESR, 37%, or 370 out of 1000 patients, are likely to have spinal malignancy, and 630 are not. These numbers are the column totals in Table 12-7. Lumbar MRI was shown by Joines and colleagues to be 95% sensitive and 95% specific for spinal malignancy; applying these statistics gives (0.95)(370), or 351.5, true-positives and (0.95)(630), or 598.5, true-negatives. Subtraction gives 18.5 false-negatives and 31.5 false-positives. After adding the rows, the predictive value of a positive lumbar MRI is 351.5/383, or 91.8%, and the predictive value of a negative lumbar MRI is 97% (see Table 12-7).
Table 12-7. Completed 2 × 2 table for ultrasound examination from Presenting Problem 1. |
||||||||||||||||||||||||||||||||||||||||
|
Joines and colleagues (2001) concluded that the ESR with a cutoff of ≥ 20 mm/h is of minimal utility. We will refer to their study again in the section titled, “Using Decision Analysis to Compare Strategies.”
The Likelihood Ratio
An alternative method for incorporating information provided by the sensitivity and specificity of a test is the likelihood ratio; it usesodds rather than probabilities. The likelihood ratio is being used with increasing frequency in the medical literature, especially within the context of evidence-based medicine. Even if you decide not to use this particular approach to revising probabilities, you need to know how to interpret the likelihood ratio. Because it makes calculating predictive values very simple, many people prefer it after becoming familiar with it
The likelihood ratio expresses the odds that the test result occurs in patients with the disease versus the odds that the test result occurs in patients without the disease. Thus, a positive test has one likelihood ratio and a negative test another. For a positive test, the likelihood ratio is the sensitivity divided by the false-positive rate. The likelihood ratio is multiplied by the prior, or pretest odds, to obtain theposttest odds of a positive test. Thus,
In Presenting Problem 1, the sensitivity of the ESR for spinal malignancy is 78%; and the specificity is 67%, giving a false-positive rate of 100% – 67% = 33%. The likelihood ratio (LR) for a positive test is therefore
To use the likelihood ratio, we must convert the prior probability into prior odds. The prior probability of spinal malignancy is 0.20, and the odds are found by dividing the probability by 1 minus the probability, giving
It helps to keep in mind that the probability is a proportion: It is the number of times a given outcome occurs divided by all the occurrences. If we take a sample of blood from a patient five times, and the sample is positive one time, we can think of the probability as being 1 in 5, or 0.20. The odds, on the other hand, is a ratio: It is the number of times a given outcome occurs divided by the number of times that specific outcome does not occur. With the blood sample example, the odds of a positive sample is 1 to 4, or 1/(5 – 1). This interpretation is consistent with the relative risk and odds ratio, which indicate the risk in a population with a risk factor divided by the risk in a population without the risk factor.
Continuing with the ESR example, we multiply the pretest odds by the likelihood ratio to obtain the posttest odds:
Because the posttest odds are really 0.59 to 1, although the “to 1” part does not appear in the preceding formula, these odds can be used by clinicians simply as they are. Because the odds are less than 1, or less than 50–50, we know the probability will be less than 0.5. Alternatively, the odds can be converted back to a probability by dividing the odds by 1 plus the odds. That is,
The posterior probability is, of course, the predictive value of a positive test and is the same result we found earlier
Many journal articles that present likelihood ratios use the LR as just defined, which is actually the likelihood ratio for a positive test. The evidence-based medicine literature sometimes uses the notation of +LR to distinguish it from the LR for a negative test, generally denoted by -LR. The negative likelihood ratio can be used to find the odds of disease, even if the test is negative. It is the ratio of false-negatives to true-negatives (FN/TN).
To illustrate the use of the negative likelihood ratio, let us find the probability of spinal malignancy if the ESR is negative. The -LR in this example is 0.22 (1 minus sensitivity) divided by 0.67 (true-negatives), or 0.328. Multiplying the -LR by the prior odds of the disease gives 0.328 × 0.25 = 0.082, the posttest odds of disease with a negative test. We can again convert the odds to a probability by dividing 0.082 by 1 + 0.082 to obtain 0.076, or 7.6%. This value tells us that a person such as our patient, about whom our index of suspicion is 20%, has a posttest probability of spinal malignancy, even with a negative test, of approximately 7.6%. This result is consistent with the predictive value of a negative test of 92.4% (see the section titled, “The 2 × 2 Table Method.”). So, another way to interpret the -LR is that it is analogous to 1 minus PV-.
Absolutely nothing is wrong with thinking in terms of odds instead of probabilities. If you are comfortable using odds instead of probabilities, the calculations are really quite streamlined. One way to facilitate conversion between probability and odds is to recognize the simple pattern that results. We list some common probabilities in Table 12-8, along with the odds and the action to take with the likelihood ratio to find the posttest odds.
To use the information in Table 12-8, note that the odds in column 2 are < 1 when the probability is < 0.50, are equal to 1 when the probability is 0.50, and are > 1 when the probability is > 0.50. The last column shows that we divide the likelihood ratio to obtain the posttest odds when the prior probability is < 0.50 and multiply when it is > 0.50. To illustrate, suppose your index of suspicion, prior to ordering a diagnostic procedure, is about 25%. A probability of 0.25 gives 1 to 3 odds, so the likelihood ratio is divided by 3. If your index of suspicion is 75%, the odds are 3 to 1, and the likelihood ratio is multiplied by 3. Once the posttest odds are found, columns 2 and 1 can be used to convert back to probabilities.
Table 12-8. Conversion table for changing probabilities to odds and action to take with likelihood ratio to obtain posttest odds. |
||||||||||||||||||||||||||||||
|
A major advantage of the likelihood ratio method is the need to remember only one number, the ratio, instead of two numbers, sensitivity and specificity. Sackett and colleagues (1991) indicate that likelihood ratios are much more stable (or robust) for indicating changes in prevalence than are sensitivity and specificity; these authors also give the likelihood ratios for some common symptoms, signs, and diagnostic tests.
Figure 12-2. Nomogram for using Bayes' theorem. (Adapted and reproduced, with permission, from Fagan TJ: Nomogram for Bayes' theorem. (Letter.) N Engl J Med 1975; 293: 257.) |
Figure 12-3. Decision tree with test result branches. |
A nomogram published by Fagan (1975) makes the likelihood ratio somewhat simpler to use. In this nomogram, reproduced in Figure 12-2, the pretest and posttest odds are converted to prior and posterior probabilities, eliminating the need to perform this extra calculation.
To use the nomogram, place a straightedge at the point of the prior probability, denoted P(D), on the right side of the graph and the likelihood ratio in the center of the graph; the revised probability, or predictive value P(D|T), is then read from the left-hand side of the graph. In our example, the prior percentage of 20 and the likelihood ratio of 2.36 result in a revised percentage near 40, consistent with the previous calculations.
The Decision Tree Method
Using Presenting Problem 1 again, we illustrate the decision tree method for revising the initial probability, a 20% chance of spinal malignancy in this example. Trees are useful for diagramming a series of events, and they can easily be extended to more complex examples as we will see in Presenting Problem 4. Figure 12-3 illustrates that prior to ordering a test, the patient can be in one of two conditions: with the disease (or condition) or without the disease (or condition). These alternatives are represented by the branches, with one labeled D+, indicating a disease present and the other labeled D- representing no disease
The prior probabilities are included on each branch, 20% on the D+ branch and 100% – 20% or 80% on the D- branch. The test can be either positive or negative, regardless of the patient's true condition. These situations are denoted by T+ for a positive test and T- for a negative test and are illustrated in the decision tree by the two branches connected to both the D+ and D- branches.
In the next step, information on sensitivity and specificity of the test is added to the tree. Concentrating on the D+ branch, an ESR is positive in approximately 78% of these patients. Figure 12-4 shows the 78% sensitivity of the test written on the T+ line. In 22% of the cases (100% – 78%), the test is negative, written on the T- line. This information is then combined to obtain the numbers at the end of the lines: The result for 78% of the 20% of the men with spinal malignancy, or (20%)(78%) = 15.6%, is written at the end of the D+T+ branch; the result (20%)(22%) = 4.4% for men with spinal malignancy who have a negative test is written at the end of the D+T- branch.
Similar calculations are done for the 80% of men who do not have spinal malignancy. Note that the percentages at the ends of the four branches add to 100%. At this point, the decision tree is complete. However, the tree can also be used to find predictive values by “reversing” the tree. Figure 12-4 is replicated in Figure 12-5, showing that the first two lines are the results of the tests, T+ and T-, and the disease state of D+ and D- is associated with each. The values at the end of each line are simply transferred; that is, 16.6% corresponding to D+T+is placed on the T+D+ line, etc. The two outcomes related to a positive test are added to obtain 42%. The predictive values can now be determined. If this man has a positive ESR, the revised probability is found by dividing 15.6% by 42%, giving a 37.1% chance, the same as the conclusions reached by using the 2 × 2 table method. Other predictive values are equally easy to find.
Bayes' Theorem
Another method for calculating the predictive value of a positive test involves the use of a mathematical formula. Bayes' theorem is not new; it was developed in the 18th century by an English clergyman and mathematician, Thomas Bayes, but was not published until after his death. It had little influence at the time, but two centuries later it became the basis for a different way to approach statistical inference, called Bayesian statistics. Although it was used in the early clinical epidemiology literature, the 2 × 2 table and likelihood ratio are seen today with much greater frequency and are therefore worth knowing
Figure 12-4. Decision tree with test result information. |
The formula for Bayes' theorem gives the predictive value of a positive test, or the chance that a patient with a positive test has the disease. The symbol P stands for the probability that an event will happen (see Chapter 4), and P(D+|T+) is the probability that the disease is present, given that the test is positive. As we discussed in Chapter 4, this probability is a conditional probability in which the event of the disease being present is dependent, or conditional, on having a positive test result. The formula, known as Bayes' theorem, can be rewritten from the form we used in Chapter 4, as follows:
This formula specifies the probability of disease, given the occurrence of a positive test. The two probabilities in the numerator are (1) the probability that a test is positive, given that the disease is present (or the sensitivity of the test) and (2) the best guess (or prior probability) that the patient has the disease to begin with. The denominator is simply the probability that a positive test occurs at all, P(T+), which can occur in one of two ways: a positive test when the disease is present, and a positive test when the disease is not present, each weighted by the prior probability of that outcome. The first quantity in this term is simply the false-positive rate, and the second can be thought of as 1 minus the probability that the disease is present.
Rewriting Bayes' theorem in terms of sensitivity and specificity, we obtain
We again use Presenting Problem 1 to illustrate the use of Bayes' formula. Recall that the prior probability of spinal malignancy is 0.20, and sensitivity and specificity of the ESR for spinal malignancy are 78% and 67%, respectively. In the numerator, the sensitivity times the probability of disease is (0.78)(0.20). In the denominator, that quantity is repeated and added to the false-positive rate, 0.33, times 1 minus the probability of malformation, 0.80. Thus, we have
This result, of course, is exactly the same as the result obtained with the 2 × 2 table and the decision tree methods
Figure 12-5. Reversing tree to correspond with situation facing physicians. |
A similar formula may be derived for the predictive value of a negative test:
Calculation of the predictive value using Bayes' theorem for a negative test in the ESR example is left as an exercise.
Using Sensitivity and Specificity in Clinical Medicine
Fairly typical in medicine is the situation in which a very sensitive test (95–99%) is used to detect the presence of a disease with low prevalence (or prior probability); that is, the test is used in a screening capacity. By itself, these tests have little diagnostic meaning. When used indiscriminately to screen for diseases that have low prevalence (eg, 1 in 1000), the rate of false positivity is high. Tests with these statistical characteristics become more helpful in making a diagnosis when used in conjunction with clinical findings that suggest the possibility of the suspected disease. To summarize, when the prior probability is very low, even a very sensitive and specific test increases the posttest probability only to a moderate level. For this reason, a positive result based on a very sensitive test is often followed by a very specific test, such as following a positive antinuclear antibody test (ANA) for systemic lupus erythematosus with the anti-DNA antibody procedure
Another example of a test with high sensitivity is the serum calcium level. It is a good screening test because it is almost always elevated in patients with primary hyperparathyroidism—meaning, it rarely “misses” a person with primary hyperparathyroidism. Serum calcium level is not specific for this disease, however, because other conditions, such as malignancy, sarcoidosis, multiple myeloma, or vitamin D intoxication, may also be associated with elevated serum calcium. A more specific test, such as radioimmunoassay for parathyroid hormone, may therefore be ordered after finding an elevated level of serum calcium. The posterior probability calculated by using the serum calcium test becomes the new index of suspicion (prior probability) for analyzing the effect of the radioimmunoassay.
The diagnosis of HIV in low-risk populations provides an example of the important role played by prior probability. Some states in the United States require premarital testing for the HIV antibody in couples applying for a marriage license. The enzyme-linked immunosorbent assay (ELISA) test is highly sensitive and specific; some estimates range as high as 99% for each. The prevalence of HIV antibody in a low-risk population, such as people getting married in a Midwestern community, however, is very low; estimates range from 1 in 1000 to 1 in 10,000. How useful is a positive test in such situations? For the higher estimate of 1 in 1000 for the prevalence and 99% sensitivity and specificity, 99% of the people with the antibody test positive (99% × 1 = 0.99 person), as do 1% of the 999 people without the antibody (9.99 people). Therefore, among those with a positive ELISA test (0.99 + 9.99 = 10.98 people), less than 1 person is truly positive (the positive predictive value is actually about 9% for these numbers).
The previous examples illustrate three important points:
1. To rule out a disease, we want to be sure that a negative result is really negative; therefore, not very many false-negatives should occur. A sensitive test is the best choice to obtain as few false-negatives as possible if factors such as cost and risk are similar; that is, high sensitivity helps rule out if the test is negative. As a handy acronym, if we abbreviate sensitivity by SN, and use a sensitive test to rule OUT, we have SNOUT.
2. To find evidence of a disease, we want a positive result to indicate a high probability that the patient has the disease—that is, a positive test result should really indicate disease. We therefore want few false-positives. The best method for achieving this is a highly specific test—that is, high specificity helps rule in if the test is positive. Again, if we abbreviate specificity by SP, and use a specific test to ruleIN, we have SPIN.
3. To make accurate diagnoses, we must understand the role of prior probability of disease. If the prior probability of disease is extremely small, a positive result does not mean very much and should be followed by a test that is highly specific. The usefulness of a negative result depends on the sensitivity of the test.
Assumptions
The methods for revising probabilities are equivalent, and you should feel free to use the one you find easiest to understand and remember. All the methods are based on two assumptions: (1) The diseases or diagnoses being considered are mutually exclusive and include the actual diagnosis; and (2) the results of each diagnostic test are independent from the results of all other tests
The first assumption is easy to meet if the diagnostic hypotheses are stated in terms of the probability of disease, P(D+), versus the probability of no disease, P(D-), as long as D+ refers to a specific disease.
The second assumption of mutually independent diagnostic tests is more difficult to meet. Two tests, T1 and T2, for a given disease are independent if the result of T1 does not influence the chances associated with the result of T2. When applied to individual patients, independence means that if T1 is positive in patient A, T2 is no more likely to be positive in patient A than in any other patient in the population that patient A represents. Even though the second assumption is sometimes violated in medical applications of decision analysis, the methods described in this chapter appear to be fairly robust.
Finally, it is important to recognize that the values for the sensitivity and specificity of diagnostic procedures assume the procedures are interpreted without error. For example, variation is inherent in determining the value of many laboratory tests. As mentioned in Chapter 3, the coefficient of variation is used as a measure of the replicability of assay measurements. In Chapter 5 we discussed the concepts of intrarater and interrater reliability, including the statistic kappa to measure interjudge agreement. Variability in test determination or in test interpretation is ignored in calculating for sensitivity, specificity, and the predictive values of tests.
ROC CURVES
The preceding methods for revising the prior (pretest) probability of a disease or condition on the basis of information from a diagnostic test are applicable if the outcome of the test is simply positive or negative. Many tests, however, have values measured on a numerical scale. When test values are measured on a continuum, sensitivity and specificity levels depend on where the cutoff is set between positive and negative. This situation can be illustrated by two normal (gaussian) distributions of laboratory test values: one distribution for people who have the disease and one for people who do not have the disease. Figure 12-6 presents two hypothetical distributions corresponding to this situation in which the mean value for people with the disease is 75 and that for those without the disease is 45. If the cutoff point is placed at 60, about 10% of the people without the disease are incorrectly classified as abnormal (false-positive) because their test value is greater than 60, and about 10% of the people with the disease are incorrectly classified as normal (false-negative) because their test value is less than 60. In other words, this test has a sensitivity of 90% and a specificity of 90%
Suppose a physician wants a test with greater sensitivity, meaning that the physician prefers to have more false-positives than to miss people who really have the disease. Figure 12-7 illustrates what happens if the sensitivity is increased by lowering the cutoff point to 55 for a normal test. The sensitivity is increased, but at the cost of a lower specificity.
Figure 12-6. Two hypothetical distributions with cutoff at 60. TN = true-negative; TP = true-positive; FN = false-negative; FP = false-positive. |
A more efficient way to display the relationship between sensitivity and specificity for tests that have continuous outcomes is withreceiver operating characteristic, or ROC, curves. ROC curves were developed in the communications field as a way to display signal-to-noise ratios. If we think of true-positives as being the correct signal from a diagnostic test and false-positives as being noise, we can see how this concept applies. The ROC curve is a plot of the sensitivity (or true-positive rate) to the false-positive rate. The dotted diagonal line in Figure 12-8 corresponds to a test that is positive or negative just by chance. The closer an ROC curve is to the upper left-hand corner of the graph, the more accurate it is, because the true-positive rate is 1 and the false-positive rate is 0. As the criterion for a positive test becomes more stringent, the point on the curve corresponding to sensitivity and specificity (point A) moves down and to the left (lower sensitivity, higher specificity); if less evidence is required for a positive test, the point on the curve corresponding to sensitivity and specificity (point B) moves up and to the right (higher sensitivity, lower specificity).
ROC curves are useful graphic methods for comparing two or more diagnostic tests or for selecting cutoff levels for a test. For example, in Presenting Problem 3, Maisel and colleagues (2002) performed a prospective study to validate the use of B-type natriuretic peptide (BNP) measurements in the diagnosis of CHF in patients with shortness of breath seen in emergency departments. They had to decide where to put the cutoff level for BNP. These investigators' findings are reproduced in (Box 12-1). The ROC curve illustrates that lowering the level of BNP that is considered a positive finding results in higher sensitivity but increasing false-positives (or decreasing sensitivity).
Figure 12-7. Two hypothetical distributions with cutoff at 55. TN = true-negative; TP = true-positive; FN = false-negative; FP = false-positive. |
Figure 12-8. Receiver operating characteristic curve. |
A statistical test can be performed to evaluate an ROC curve or to determine whether two ROC curves are significantly different. A commonly used procedure involves determining the area under each ROC curve and uses a modification of the Wilcoxon rank sum procedure to compare them. Box 12-1 shows that the area under the curve for BNP is 0.91 with a 95% confidence interval from 0.90 to 0.93. Is the curve significant? The answer is yes, because it does not contain 0.5.
DECISION ANALYSIS
Decision analysis can be applied to any problem in which a choice is possible among different alternative actions. Any decision has two major components: specifying the alternative actions and determining the possible outcomes from each alternative action. Figure 12-9 illustrates both components for the study of enforcing a strict lead poisoning policy (Brown, 2002). The alternative actions are to enforce or not to enforce a strict policy. If no enforcement is selected, two outcomes are possible: the recurrence of lead poisoning in additional children using three different blood lead levels to determine the cost of enforcement, and no occurrence of additional cases. A similar set of outcomes occur if the decision strategy is to strictly enforce the policy
Box 12-1. Receiver operating characteristic curve for cutoff levels of B-type natriuretic peptide in differentiating between dyspnea due to congestive heart failure and dyspnea due to other causes.
Figure. No caption available. |
||||||||||||||||||||||||||||||||||||||||||
|
The point at which a branch occurs is called a node; in Figure 12-9, nodes are identified by either a square or a circle. The square denotes a decision node—a point at which the decision is under the control of the decision maker, whether or not to enforce the strict policy in this example. The circle denotes a chance node, or point at which the results occur by chance; thus, whether lead poisoning recurs is a chance outcome of a decision to or not to enforce the policy.
Determining Probabilities
The next step in developing a decision tree is to assign a probability to each branch leading from each chance node. For example, what is the probability of recurrence if the policy is not enforced? What is the probability of recurrence if the policy is enforced? Of the cost if a BPb cutoff from 10 to 14 is used? Of the cost if a BPb cut off from 15 to 24 is used?
To determine the probabilities for the decision tree, the investigators may survey the medical literature and incorporate the results from previous studies. Brown analyzed a set of data to determine the probability of each occurrence of blood poisoning and resulting blood levels. These probabilities are written in the boxes under the branches in Figure 12-10. Note that the combined probabilities for any set of outcomes, such as for recurrence and no recurrence, equal 1.
Figure 12-9. Decision tree for blood lead level enforcement. BPb = blood lead levels ľg/dL. (Reproduced, with permission, from Figure 1 in Brown MJ: Costs and benefits of enforcing housing policies to prevent childhood lead poisoning. Med Decis Making 2002; 22: 482–492.) |
Deciding the Value of the Outcomes
The final step in defining a decision problem requires assigning a value, or utility, to each outcome. With some decision problems, the outcome is cost, and dollar amounts can be used as the utility of each outcome. In the Brown study, the investigator reviewed the literature and consulted various agencies for estimates of cost.
Objective versus subjective outcomes
Outcomes based on objective probabilities, such as costs, numbers of years of life, quality-adjusted years of life (QALYs), or other variables that have an inherent numeric value, can be used as the utilities for a decision. When outcomes are based on subjective probabilities, investigators must find a way to give them a value. This process is known as assigning a utility to each outcome. The scale used for utilities is arbitrary, although a scale from 0 for least desirable outcome, such as death, to 1 or 100 for most desirable outcome, such as perfect health, is frequently used.
An example of determining subjective utilities
Subjective utilities can be obtained informally or by a more rigorous process called a lottery technique. This technique involves a process called game theory. To illustrate, suppose we ask you to play a game in which you can choose a prize of $50 or you can play the game with a 50–50 chance of winning $100 (and nothing if you lose). Here, the expected value of playing the game is 0.50 × $100 = $50, the same as the prize. Do you take the sure $50 or play the game? If you choose not to gamble and take $50 instead, then we ask whether you will play the game if the chance of winning increases from 50% to 60%, resulting in an expected value of $60, $10 more than the prize. If you still take $50, then we increase the chance to 70%, and so on, until it reaches a point at which you cannot decide whether to play the game or take the prize, called the point of indifference. This is the value you attach to playing this game. We say you are risk-averse when you refuse to gamble even when the odds are in your favor, that is, when the expected value of the game is more than the prize
Suppose now that a colleague plays the game and chooses the $50 prize when the chance of winning $100 is 50–50. Then, we ask whether the colleague will still play if the chance of winning $100 is only 40%, and so on, until the point of indifference is reached. We describe your colleague as risk-seeking when he or she is willing to gamble even when the odds are unfavorable and the expected value of the game is less than the prize.
Figure 12-10. Decision tree for blood lead level enforcement with probabilities and costs. Abbreviation: BPb = blood lead levels ľg/dL. (Adapted and reproduced, with permission, from Figure 1 in Brown MJ: Costs and benefits of enforcing housing policies to prevent childhood lead poisoning. Med Decis Making 2002; 22: 482-492.) |
Completing the Decision Tree
The analysis of the decision tree involves combining the probabilities of each action with the utility of each so that the optimal decision can be made at the decision nodes. To decide whether to enforce a strict policy for preventing lead poisoning, the Brown study in Presenting Problem 4 defined the decision alternatives and possible outcomes, along with estimated probabilities and costs, as illustrated in Figure 12-10
The decision tree is analyzed by a process known as calculating the expected utilities. Calculations begin with the outcomes and work backward through the tree to the point where a decision must be made. In our example, the first step is to determine the expected cost (EC) of the blood lead level outcomes related to a recurrence, obtained by multiplying the probability of each outcome by the cost for that outcome and summing the relevant products. In words, we add the probability of BPb 10–14 (0.46) times the associated cost ($74,166), the probability of BPb 15–24 (0.43) times the associated cost ($156,151), and the probability of BPb 25 (0.11) times the associated cost ($349,660):
The result, $139,724, indicates the average cost over all outcomes from the decision not to enforce the policy in children who subsequently develop lead poisoning. This process is repeated for each chance node, one step at a time, back through the tree. Continuing with the example, the EC of not enforcing the policy is the probability of recurrence times the cost associated with this outcome, just found to be $139,724, plus the probability of no recurrence times the cost; that is,
Similarly, the EC of recurrence with enforcement of the policy is
and the EC of enforcing a strict policy is $119,009, plus the probability of no recurrence times the cost; that is
The expected costs are added to the decision tree in Figure 12-11. The expected cost of with no enforcement is $139,724 per child, compared with $56,639 for strict enforcement. Enforcement therefore reduces the cost from $101,998 to $56,639, or $45,349 per child who develops lead blood poisoning, on the average. The most effective strategy is called the dominant strategy.
Conclusions from the Decision Analysis
The investigator concluded that “a strategy of strict enforcement of housing policies … results in savings due to … decreased medical costs for children who are protected from lead exposure.”
Figure 12-11. Decision tree for blood lead level enforcement. Abbreviation: BPb = blood lead levels ľg/dL; EC = expected cost. (Adapted and reproduced, with permission, from Figure 1 in Brown MJ: Costs and benefits of enforcing housing policies to prevent childhood lead poisoning. Med Decis Making 2002; 22: 482–492.) |
The optimal decision is the one with the lowest cost or largest expected value, and the decision maker's choice is relatively easy. When the expected costs or utilities of two decisions are very close, the situation is called a toss-up, and considerations such as the estimates used in the analysis become more important.
Evaluating the Decision: Sensitivity Analysis
Accurate probabilities for each branch in a decision tree are frequently difficult to obtain from the literature. Investigators often have to use estimates made for related situations. For example, in Presenting Problem 4, the author states that distributions of blood lead levels and costs of lead hazard reduction were varied in the model, and the strategy of strict enforcement is less costly over a wide range of possible values. The procedure for evaluating the way the decision changes as a function of changing probabilities and utilities is calledsensitivity analysis. It is possible to perform an analysis to determine the sensitivity of the final decision to two or more assumptions simultaneously. Most statisticians and researchers in decision analysis recommend that all published reports of a decision analysis include a sensitivity analysis to help readers know the range of applications is which the results are applicable.
USING DECISION ANALYSIS TO COMPARE STRATEGIES
No consensus exists about which diagnostic approach is the best for follow-up in a patient with low back pain and suspected spinal malignancy. Among the available diagnostic procedures are erythrocyte sedimentation rate (ESR), lumbosacral radiograph, bone scan, lumbar MRI, and biopsy under fluoroscopic guidance. An ideal diagnostic protocol would combine these tests so that all cancers could be found without undue costs, risks, or discomfort to the patient. The ideal does not exist, however, because none of the tests is perfect; and as more tests are done, the costs and risks increase accordingly
Several protocols have been recommended in the literature, and the range of procedures used in actual practice varies widely. Joines and colleagues (2001) designed a decision analysis to determine the most effective protocol. They examined several protocols; these are outlined in Figure 12-12A and B. We discuss several of the protocols and illustrate the way in which Joines and colleagues evaluated their effectiveness.
Figure 12-12. Screening strategies for spinal malignancy. (Reproduced, with permission, from Tables 1A and 1B from Joines JD, McNutt RA, Carey TS, Deyo RA, Rouhani R: Finding cancer in primary care outpatients with low back pain. J Gen Intern Med 2001; 16: 14–23.) |
Protocols Evaluated
The strategies evaluated are lettered A through F. Figure 12-12A shows strategy A as a rather complex algorithm. If a patient has a history of cancer and the ESR is < 20, a radiograph is done, which, if positive, leads to imaging studies. With a history of cancer and an ESR ≥ 20, the patient goes directly to imaging studies. On the other hand, if there is no history of cancer, the approach is based on the number of risk factors the patient has (age ≥ 50 years, weight loss, or failure to improve). Several other strategies are illustrated in Figure 12-12B and require either a history of cancer or one of the previously mentioned three risk factors. A summary of several strategies follows
B. ESR, if < 20, is followed by radiograph. If radiograph is positive, it is followed by imaging studies. If ESR ≥ 20, imaging studies are done.
B2. If a history of cancer, or ESR ≥ 20, or positive radiograph, perform imaging studies.
C. Imaging studies are done on all patients.
D. ESR, if ≥ 20, is followed by imaging studies.
E. Radiograph, if positive, is followed by imaging studies.
E2. If a history of cancer or positive radiograph, perform imaging studies.
F. ESR, if ≥ 20, is followed by radiograph. If radiograph is positive, it is followed by imaging studies.
Assumptions Made in the Analysis
A decision problem of this scope required the investigators to estimate the sensitivity and specificity of a number of patient characteristics and diagnostic tests for spinal malignancy. The values of the patient characteristics and costs for the diagnostic tests are given in Table 12-9.
Results of the Decision Analysis
The investigators developed a decision tree for each of the protocols evaluated in their study. Using a computer program, they evaluated each tree to determine the effectiveness of the protocol. The results for the five dominant strategies are given in Table 12-10. In general, the relationship between the sensitivity of a strategy and its cost is positive, but the relationship is not linear. The procedure that detects the most cancers is strategy C, imaging all patients; however, it has the lowest specificity (resulting in more false-positives), and it is the most costly ($241 per patient). Protocol B2 is almost as sensitive, is more specific, and costs only $110 per patient. A number of other interesting criteria with which to judge the strategies are given in Table 12-10
The authors prepared a graph comparing the numbers of cases of cancers found per 1000 patients and the cost per patient. The graph is reproduced in Figure 12-13 and clearly illustrates the conclusions we just described.
Table 12-9. Clinical findings and diagnostic tests used in the decision model. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Conclusions from the Decision Analysis
From the results of the decision analysis of the different strategies, the investigators recommended strategy B2: imaging patients with a clinical finding (history of cancer, age ≥ 50 years, weight loss, or failure to improve) if there is an elevated ESR (≥ 50 mm/h) or a positive radiograph, or directly imaging patients with a history of cancer.
USING DECISION ANALYSIS TO EVALUATE TIMING & METHODS
In Presenting Problem 5, Brown and Garber (1999) evaluated the cost-effectiveness (CE) of three new technologies to improve the sensitivity of detecting cancer screening compared with the Pap smears that are the standard of care. The Pap smear is considered to be a cost-effective tool, but it has a low sensitivity rate for a screening test, only about 75–85%. Several new technologies have been developed that are reported to improve the sensitivity of Pap testing; however, the new technologies increase the cost of each test
Table 12-10. Comparison of the five dominant strategies in the baseline analysis.a |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The investigators conducted a MEDLINE search for articles published between 1987 and 1997 on the use of AutoPap 300 QC, Papnet, and ThinPrep 2000 in the detection of cervical cytopathologic abnormalities. They also conducted searches of three cytopathology journals and obtained data from the manufacturers. They used estimates of the sensitivity or true-positive rate (TP) and the cost for each test in a mathematical model to calculate the lifetime costs and health effects associated with these screening strategies. The four screening strategies they evaluated were
Table 12–11. Cost-effectiveness of conventional and ThinPrep-, AutoPap-, and Papnet-enhanced cervical screening strategies, for screening women aged 20–65 years. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
1. Pap smear with rescreening of a 10% random sample
2. ThinPrep with rescreening of a 10% random sample
3. Pap smear with AutoPap-assisted rescreening of all results that were within normal limits
4. Pap smear with Papnet-assisted rescreening of all results that were within normal limits
Figure 12-13. Receiver operating curve for testing for spinal malignancy. (Reproduced, with permission, from Figure 2 from Joines JD, McNutt RA, Carey TS, Deyo RA, Rouhani R: Finding cancer in primary care outpatients with low back pain. J Gen Intern Med 2001; 16: 14–23.) |
The investigators used a theoretical cohort of women and assumed that screening starts at 20 years of age, life expectancy is 78.27 years, and unscreened women have about a 2.5% lifetime chance of developing cervical cancer and a 1.2% chance of dying from the disease. All CE ratios are expressed as screening costs in U.S. dollars per year of life saved (YLS) by using a given technology. A relatively low CE ratio for a given intervention represents a good value. The CE ratio for each test was calculated for annual, biennial, triennial, and quadrennial screening frequencies.
The result of the CE analysis is reproduced in Table 12-11. The cost of the three new technologies increased the cost per woman screened by $30 to $257. When they were compared with Pap smear alone, life expectancy increased by 5 h to 1.6 days per woman screened, depending on the technology and frequency of screening.
The final column of Table 12-11 gives the incremental cost per year of life saved. The least expensive is Pap smear screening every 4 years. Pap smear with Papnet-assisted rescreen is always the most expensive, although it is associated with the greatest increase in days of life. ThinPrep with 10% random rescreen was always dominated, meaning that this approach produced less health benefit at higher cost—thus, it would never be chosen as the best approach.
Figure 12-14 reproduces the figure in the article and illustrates the results of the cost-effectiveness analysis. Expressing the outcome in cost per year of life saved provides an interpretation that can be used to compare the cost and benefit of screening for one procedure versus another. This information can be especially useful when available resources require that decisions be made among screening for different diseases.
Another index used in decision analysis studies is the QALY, or quality-adjusted life year; it is a combination of the quantity and quality of life. The idea behind a QALY is that 1 year of life with perfect health is worth 1, and 1 year of life with less than perfect health is less than 1. To illustrate, suppose a patient will live 1 year without treatment, but that treatment will extend the patient's life to 3 years but at a lower quality of life, say 0.75. Then the number of QALYs gained if the treatment is used is: 3 years of life with a utility of 0.75 is 2.25; subtracting 1 year of life at reduced quality (1 – 0.75 = 0.25) gives 2 QALYs.
Figure 12-14. Cost-effectiveness of different technologies in average-risk women for screening intervals of 1–4 years. The solid lines apply when it is possible to vary both the technology and the frequency; dashed lines apply when the screening interval cannot be varied. Screening begins at age 20 years and continues to age 65. Numbers adjacent to the solid line are cost-effectiveness ratios in dollars per year of life saved for the two options being compared. Points below and to the right of lines represent dominated alternatives. (Reproduced, with permission, from Brown AD, Garber AM: Cost-effectiveness of three methods to enhance the sensitivity of Papanicolaou testing. JAMA 1999; 281: 347–353. Copyright Š 1999, American Medical Association.) |
COMPUTER PROGRAMS FOR DECISION ANALYSIS
Several computer programs have been written for researchers who wish to model clinical decision-making problems. These programs are generally single-purpose programs and are not included in the general statistical analysis programs illustrated in earlier chapters in this book
Decision Analysis by TreeAge (DATA) software is a decision analysis program that lets the researcher build decision trees interactively on the computer screen. Once the tree is developed, TreeAge performs the calculations for the expected utilities to determine the optimal pathway for the decision. It also performs sensitivity analysis. The information from Presenting Problem 4 on blood lead poisoning is used to illustrate the computer output from TreeAge in Figure 12-15. The researcher who uses computers to model decision problems is able to model complex problems, update or alter them as needed, and perform a variety of sensitivity analyses with relative ease.
Figure 12-15. Illustration of a decision tree, using data on blood lead level enforcement. Abbreviation: BPb = blood lead levels ľg/dL; EC = expected cost. (Adapted and reproduced, with permission, Brown MJ: Costs and benefits of enforcing housing policies to prevent childhood lead poisoning. Med Decis Making 2002; 22: 482–492. Decision Analysis by TreeAge (DATA) is a registered trademark of TreeAge Software, Inc.; used with permission.) |
SUMMARY
Topics in this chapter are departures from the topics considered in traditional introductory biostatistics textbooks. The increase in medical studies using methods in decision making and the growing emphasis on evidence-based medicine, however, indicate that practitioners should be familiar with the concepts. Equally important, the methods discussed in this chapter for calculating the probability of disease are ones that every clinician must be able to use in everyday patient care. These methods allow clinicians to integrate the results of published studies into their own practice of medicine
We presented four equivalent methods for determining how likely a disease (or condition) is in a given patient based on the results of a diagnostic procedure. Three pieces of information are needed: (1) the probability of the disease or condition prior to any procedure, that is, the base rate (or prevalence); (2) the accuracy of the procedure in identifying the condition when it is present (sensitivity); and (3) the accuracy of the procedure in identifying the absence of the condition when it is indeed absent (specificity). We can draw an analogy with hypothesis testing: A false-positive is similar to a type I error, falsely declaring a significant difference; and sensitivity is like power, correctly detecting a difference when it is present.
The logic discussed in this chapter is applicable in many situations other than diagnostic testing. For example, the answer to each history question or the finding from each action of the physical examination may be interpreted in a similar manner. When the outcome from a procedure or inquiry is expressed as a numerical value, rather than as the positive or negative evaluation, ROC curves can be used to evaluate the ramifications of decisions, such as selecting a given cutoff value or comparing the efficacy of two or more diagnostic methods.
Articles in the literature sometimes report predictive values using the same subjects that were used to determine sensitivity and specificity, ignoring the fact that predictive values change as the prior probability or prevalence changes. As readers, we can only assume these investigators do not recognize the crucial role that prior probability (or prevalence) plays in interpreting the results of both positive and negative procedures.
We extended these simple applications to more complex situations. A diagnostic procedure is often part of a complex decision, and the methods in this chapter can be used to determine probabilities for branches of the decision tree. These methods allow research findings to be integrated into the decisions physicians must make in diagnosing and managing diseases and conditions. A procedure called sensitivity analysis can be performed to determine which assumptions are important and how changes in the probabilities used in the decision analysis will change the decision.
Decision analysis can also be used to determine the most efficient approach for dealing with a problem. Because increasing attention is being focused on the cost of medical care, increasing numbers of articles dealing with decision analysis now appear in the literature. Decision analysis can help decision makers who must choose between committing resources to one program or another.
We also reviewed a novel but effective application of decision analysis to evaluating protocols recommended by experts but not subjected to clinical trial. The protocols outlined the steps to take in doing workups for patients with low back pain who may be at risk of spinal malignancy.
Finally, we described a study that compares the standard of care, the Pap smear, to three new technologies and varies the frequency of screening. The investigators were hoping to find a cost-effective method to improve the poor sensitivity of the Pap smear. They found that screening every 3 years with AutoPap or Papnet used to rescreen all Pap smear results within normal limits produces more life years at lower costs than biennial Pap smears alone.
Not many introductory texts discuss topics in medical decision making, partly because it is not viewed as mainstream biostatistics. A survey of the biostatistics curriculum in medical schools (Dawson-Saunders et al, 1987), however, found that these topics were taught at 87% of the medical schools. In summary, we note that published articles (Pauker and Kassirer, 1987; Raiffa, 1997 Davidoff, 1999; Greenhalgh, 1997a) and texts (Ingelfinger et al, 1993; Eddy, 1996; Locket, 1997; Weinstein et al, 1998; Sackett et al, 2000) discuss the role of decision analysis in medicine, and you may wish to consult these resources for a broader discussion of the issues. One advantage of performing a well-defined analysis of a medical decision problem is that the process itself forces explicit consideration of all factors that affect the decision.
Before leaving this chapter, we want to mention a very useful Web site developed by Alan Schwartz: http://www.araw.mede.uic.edu/~alansz/tools.html
It contains several computational aids for finding predictive values. One routine calculates predictive values for 2 × 2 tables.
Table 12-12. Sensitivity and specificity of different maneuvers for lumbar disk herniation. |
||||||||||||||||||
|
EXERCISES
1. Suppose a 70-year-old woman comes to your office because of fatigue, pain in her hands and knees, and intermittent, sharp pains in her chest. Physical examination reveals an otherwise healthy female—the cardiopulmonary examination is normal, and no swelling is present in her joints. A possible diagnosis in this case is systemic lupus erythematosus (SLE). The question is whether to order an ANA (antinuclear antibody) test and, if so, how to interpret the results. Tan and coworkers (1982) reported that the ANA test is very sensitive to SLE, being positive 95% of the time when the disease is present. It is, however, only about 50% specific: Positive results are also obtained with connective tissue diseases other than SLE, and the occurrence of a positive ANA in the normal healthy population also increases with age.
a. Assuming this patient has a baseline 2% chance of SLE, how will the results of an ANA test that is 95% sensitive and 50% specific for SLE change the probabilities of lupus if the test is positive? Negative?
b. Suppose the woman has swelling of the joints in addition to her other symptoms of fatigue, joint pain, and intermittent, sharp chest pain. In this case, the probability of lupus is higher, perhaps 20%. Recalculate the probability of lupus if the test is positive and if it is negative.
2. Use Bayes' theorem and the likelihood ratio method to calculate the probability of no lupus when the ANA test is negative, using a pretest probability of lupus of 2%.
3. Joines and colleagues (2001) examined two cutoff values for ESR.
a. What is the effect on the sensitivity of ESR for spinal malignancy if the threshold for positive is increased from ESR ≥ 20 mm/h to ESR ≥ 0 mm/h?
b. What is the effect on the number of false-positives?
4. A 43-year-old white male comes to your office for an insurance physical examination. Routine urinalysis reveals glucosuria. You recently learned of a newly developed test that produced positive results in 138 of 150 known diabetics and in 24 of 150 persons known not to have diabetes.
a. What is the sensitivity of the new test?
b. What is the specificity of the new test?
c. What is the false-positive rate of the new test?
d. Suppose a fasting blood sugar is obtained with known sensitivity and specificity of 0.80 and 0.96, respectively. If this test is applied to the same group that the new test used (150 persons with diabetes and 150 persons without diabetes), what is the predictive validity of a positive test?
e. For the current patient, after the positive urinalysis, you think the chance that he has diabetes is about 90%. If the fasting blood sugar test is positive, what is the revised probability of disease?
5. Consider a 22-year-old woman who comes to your office with palpitations. Physical examination shows a healthy female with no detectable heart murmurs. In this situation, your guess is that this patient has a 25–30% chance of having mitral valve prolapse, from prevalence of the disease and physical findings for this particular patient. Echocardiograms are fairly sensitive for detecting mitral valve prolapse in patients who have it—approximately 90% sensitive. Echocardiograms are also quite specific, showing only about 5% false-positives; in other words, a negative result is correctly obtained in 95% of people who do not have mitral valve prolapse.
a. How does a positive echocardiogram for this woman change your opinion of the 30% chance of mitral valve prolapse? That is, what is your best guess on mitral valve prolapse with a positive test?
b. If the echocardiogram is negative, how sure can you be that this patient does not have mitral valve prolapse?
6. Assume a patient comes to your office complaining of symptoms consistent with a myocardial infarction (MI). Based on your clinical experience with similar patients, your index of suspicion for an MI is 80%. Use the information from Shlipak and associates (1999) and Table 12-1 to answer the following questions.
a. The ECG on this patient shows ST elevation ≥ 5 mm in discordant leads. What is the probability that the patient has an MI?
b. If the ECG does not exhibit ST elevation, what is the probability that the patient has an MI anyway?
c. What do these probabilities tell you?
d. What is the likelihood ratio for ST elevation ≥ 5 mm in discordant leads?
e. What are the pretest odds? The posttest odds?
7. The Journal of the American Medical Association frequently publishes a case study with comments by a discussant. Weinstein (1998) was the discussant on a case involving a 45-year-old man with low back pain and a numb left foot. Table 1 in the article gives the sensitivity and specificity of several physical examination maneuvers for lumbar disk herniation based on a study by Deyo and colleagues (1992). Selected tests are listed in Table 12-12.
a. Which physical examination test is best to rule in lumbar disk hernia?
b. Which physical examination test is best to rule out lumbar disk hernia?
8. No consensus exists regarding the management of incidental intracranial saccular aneurysms. Some experts advocate surgery; others point out that the prognosis is relatively benign even without surgery, especially for aneurysms smaller than 10 mm. The decision is complicated because rupture of an incidental aneurysm is a long-term risk, spread out over many years, whereas surgery represents an immediate risk. Some patients may prefer to avoid surgery, even at the cost of later excess risk; others may not.
Van Crevel and colleagues (1986) approached this problem by considering a fictitious 45-year-old woman with migraine (but otherwise healthy) who had been having attacks for the past 2 years. Her attacks were right-sided and did not respond to medication. She had no family history of migraine. The neurologist suspected an arteriovenous malformation and ordered four-vessel angiography, which showed an aneurysm of 7 mm on the left middle cerebral artery. Should the neurologist advise the patient to have preventive surgery? The decision is diagrammed in Figure 12-16.
The investigators developed a scale for the utility of each outcome, ranging from 0 for death to 100 for perfect health. They decided that disability following surgery should be valued at 75.
If no surgery is performed, the possibility of a rupture is considered. A utility of 100 is still used for no rupture and for recovery following a rupture. Disability following a rupture at some time in the future is considered more positive, however, than disability following immediate surgery and is given a utility of 90.1. Similarly, death following a future rupture is preferred to death immediately following surgery and is given a utility of 60.2. These utilities are given in Figure 12-16.
a. What is the expected utility of the three outcomes if there is a rupture?
Figure 12-16. Decision tree for aneurysms with probabilities and utilities included. Abbreviation: EU = expected utility. (Adapted and reproduced, with permission, from van Crevel H, Habbema JDF, Braakman R: Decision analysis of the management of incidental intracranial saccular aneurysms. Neurology 1986; 36: 1335–1339.) |
b. What is the expected utility of the decision not to operate?
c. What is the expected utility of the decision to operate?
d. Which approach has the highest utility?
9. A decision analysis for managing ulcerative colitis points out that patients with this condition are at high risk of developing colon cancer (Gage, 1986). The analysis compares the decisions of colectomy versus colonoscopy versus no test or therapy. The decision tree developed for this problem is shown in Figure 12-17. The author used information from the literature for the data listed on the tree. The utilities are 5-year survival probabilities (multiplied by 100).
a. What is the probability of colon cancer used in the analysis?
b. The author gave a range of published values for sensitivity and specificity of colonoscopy with biopsy but did not state the precise values used in the analysis. Can you tell what sensitivity and specificity were used in the analysis?
c.
Figure 12-17. Decision tree for ulcerative colitis. (Adapted and reproduced, with permission, from Gage TP: Managing the cancer risk in chronic ulcerative colitis. A decision-analytic approach. J Clin Gastroenterol 1986; 8: 50–57.) |
d. The expected utility of the colonoscopy arm is calculated as 94.6. Calculate the expected utility of the no-test-or-therapy arm and the colectomy arm. What is the procedure with the highest expected utility, that is, what is the recommended decision?
10. Group Exercise. Select a diagnostic problem of interest and perform a literature search to find published articles on the sensitivity and specificity of diagnostic procedures used with the problem.
a. What terminology is used for sensitivity and specificity? Are these terms used, or are results discussed in terms of false-positives and false-negatives?
b. Are actual numbers given or simply the values for sensitivity and specificity?
c. Does a well-accepted gold standard exist for the diagnosis? If not, did investigators provide information on the validity of the assessment used for the gold standard?
d. Did investigators discuss the reasons for false-positives? Were the selection criteria used for selecting persons with and without the disease appropriate?
e. Did authors cite predictive values for positive and negative tests? If so, why is this inappropriate?
11. Group Exercise. The CAGE questionnaire was designed as a screening device for alcoholism (Bush et al, 1987). The questionnaire consists of four questions:
a. Have you ever felt you should Cut down on your drinking?
b. Have people Annoyed you by criticizing your drinking?
c. Have you ever felt bad or Guilty about your drinking?
d. Have you ever had a drink first thing in the morning to steady your nerves or get rid of a hangover (Eye-opener)?
The questionnaire is scored by counting the number of questions to which the patient says yes. Buchsbaum and coworkers (1991) studied the predictive validity of CAGE scores. Obtain a copy of the article and calculate the predictive validity for 0, 1, 2, 3, and 4 positive answers. Draw an ROC curve for the questionnaire. Do you think this questionnaire is a good screening tool?
e. Discuss the pros and cons of using this questionnaire with women with node-negative breast cancer and of using similar instruments for patients with other diseases. What are some possible advantages? Disadvantages? Would you use this instrument or a similar one with your patients?