Steven Piantadosi,
Jeanne Kowalski
SUMMARY OF KEY POINTS

INTRODUCTION
The purpose of this chapter is to review biostatistical concepts that are helpful to clinicians in planning, conducting, and assessing clinical trials in oncology, and to provide some insight into the direction that future trials might take in light of the genomics era. For review, we describe areas where a statistical perspective can lead to improved design, execution, analysis, and interpretation of clinical studies. Several good texts and expository articles provide additional details regarding these concepts. These include history and policy,^{ [1] [2]} general discussions,^{ [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]} cancer trials,^{ [13] [14] [15] [16]} ethics,^{ [17] [18] [19]} prognostic factor analyses,^{ [20] [21]} and reporting.^{ [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]}
Much of this chapter pertains to study design, because a welldesigned and wellexecuted trial addressing an important therapeutic or management question will usually provide cogent evidence without an elaborate analysis. In fact, statistical analysis can do little to make the results of a poorly designed or executed trial compelling. This is not to trivialize analysis; an improper one can distort the findings of a welldesigned and wellexecuted trial. To this end, our first topic is outlining some of the sources of uncertainty in inferences from clinical trials including dosefinding (phase I) studies, safety and activity (phase II) studies, and comparative (phase III) studies. Following this are discussed five areas of statistical activity that are important to the success of a program of studies in the management of oncologic disease: (1) formulation and refinement of an important therapeutic question through the use of developmental trials, (2) designing comparative trials, (3) implementing a trial and assuring quality control, (4) data analysis, and (5) describing results and preparing publications.
Later in this chapter, we discuss the future direction of cancer trials considering the combined roles of biostatistics and bioinformatics in cancer research, using vaccine trials as an example. Gene expression microarrays and gene expression databases provide new opportunities for the discovery of drug targets and for determining a drug's mode of action. Bioinformatics provides the computational tools by which to extract this information. Biostatistics provides the analytic tools to incorporate this information into clinical trial design and analysis. We later revisit concepts introduced at the begining of this chapter within the context of cancer vaccine trials, because such trials represent a premier application of both fields with the development of immunobased therapies designed to target discovered genes’ function.
Many medical advances have been, and continue to be, made without conducting formal clinical trials. For example, when treatment effects are large, they usually become evident despite the variability and bias present in less formal methods of evaluation. However, many important treatment effects are characterized by natural variability of about the same magnitude as the treatment effect itself. In these circumstances, only careful design and conduct of clinical trials will separate treatment effect from bias and error reliably. Welldesigned studies will estimate the magnitude of important clinical effects, quantify errors resulting from chance, reduce or eliminate bias, provide a high degree of credibility and reproducibility in results, and influence future clinical practice.
To meet these goals, investigators often conduct a variety of types of clinical studies. It is helpful to distinguish clinical trials from other types of medical investigations on the basis of who controls three essential components of design: treatment or exposure of the subjects, endpoint ascertainment (i.e., collection of outcome data), and analysis. True experiments place all three components under the control of the investigator. For example, a case report is relatively weak evidence, because it is a demonstration only that some event of clinical interest is possible. A case series is a demonstration of possibly related clinical events but is subject to large selection biases. In a database analysis, treatment is not determined by design but by patient or physician preference, permitting large biases. In an observational study the investigator takes advantage of natural exposures or treatment selection and chooses an appropriate comparison group by design. However, confounders may not be controlled adequately. In a clinical trial, treatment assignment is by design and endpoint ascertainment is actively performed on all subjects.
Using this type of hierarchy, one can see that the strength of evidence in medical studies is directly related to the amount of prospective design in the investigation. The more control exerted by the investigators over the essential components, the stronger will be the design and the more credible will be the results. See Byar^{[34]} for a general discussion of this topic.
SOURCES OF UNCERTAINTY IN CLINICAL TRIALS
The most convincing clinical trials make use of methods to control and minimize relevant sources of uncertainty. Two types of uncertainty or errors can result when making inferences about treatment effects in medical studies: bias (systematic error) and random error. Both types of error can be controlled by using proper design. However, neither type of error can be reliably controlled by analysis alone.
Bias
There are numerous sources and types of bias in clinical trials; for example, see Sackett^{[35]} or Chalmers.^{[36]} All biases produce systematically high or low misestimates of the true treatment effect. Most of the time we do not know the direction of a particular bias, which means that it can either mimic a treatment effect or obscure it. Because, in human studies, we are often interested in treatment effects that are about the same magnitude as potential biases (and natural variation), control of systematic errors is critical. Consequently, we routinely attempt to eliminate bias through the appropriate use of design features such as eligibility criteria, randomization, objective endpoints, active ascertainment of outcomes, and treatment masking.
Patients who agree to participate in a clinical trial are usually not perfectly representative of the population with the disease (selection bias). Although this is often said to affect the external validity of the study, it is unlikely to affect the estimates of treatment differences. When the comparison group is subject to the same selection effect, as in randomized studies, relative treatment effects are estimated essentially without bias and are likely to generalize to individuals who do not meet the eligibility criteria. More clinically consequential biases arise from exclusion of patients after study entry, loss of data for reasons associated with outcome or prognosis, differential assessment of outcomes in treatment groups, and retrospective definitions or analyses. For example, it often seems clinically appropriate to exclude patients because of nonevaluability or nonadherence with the study protocol. However, such definitions are applied after registration or randomization and are therefore outcomes as well as potential predictors. One cannot reliably make exclusions based on such outcomes without the potential for bias.
Some data can be missing for reasons associated with outcome. For example, a recurrence (or death) event may not be observed because the patient has not returned to clinic for followup visits. Commonly used life table methods assume that such study subjects are censored at the time of last followup. Yet informative censoring results in an underreporting of events and can only be corrected by actively ascertaining the status of all patients.
Random Errors
In later sections of this chapter we emphasize estimation of effects and confidence intervals as being the most clinically useful summary of data. However, because statistical hypothesis tests have had a prominent role in the design and analysis of trials and still provide a useful perspective on errors of inference, we discuss errors attributable to chance in these traditional terms. The two types of random error that can result from a formal hypothesis test are shown in Table 221 . The type I error is a falsepositive result and occurs if there is no treatment effect or difference but the investigators wrongly conclude that there is. The chance of making a type I error is frequently under the control of the investigator, even into the analysis stage of a clinical trial. This is true because the type I error can usually be controlled through the level of significance chosen for statistical tests.
Table 221  Random Errors from Hypothesis Tests

TRUE STATE OF NATURE 

Result of Hypothesis Test 
H_{0} True 
H_{0} False 
Reject H_{0} 
Type I error 
No error 
Do not reject H_{0} 
No error 
Type II error 
The type I error must be carefully considered during the design of a clinical trial when multiple statistical tests are to be performed, a process that inflates the overall type I error. This happens, for example, when investigators intend to examine accumulating data and repeatedly perform statistical tests, as is done in sequential or group sequential interim monitoring of clinical trials. Failing to account properly for the effect of such repeated hypothesis tests can greatly increase the type I error rate. This point will be expanded later in a discussion of sequential methods.
The type II error is a falsenegative result and occurs when we fail to detect a treatment effect or difference that is actually present. The power of a clinical trial is the chance of declaring a treatment effect of a specified size to be statistically significant (i.e., not making a type II error). The type II error can only be controlled by proper design, specifically a sufficiently large sample size, and not by procedures used in the analysis of the study.
A small study can yield a high power to detect a large treatment difference. However, as indicated previously, clinicians are usually genuinely interested in modestsized or small treatment effects. A small study will have low power to detect small differences. There is little sense in undertaking a complex expensive trial when the chance of missing a clinically important effect is larger than the chance of finding it.
A power calculation is hypothetical based on a specified, but as yet unobserved, treatment effect. Some observers of trials consider the power of a completed study against the observed difference, socalled post hoc power. When the observed treatment difference is smaller than anticipated, the post hoc power of the trial against that difference will be low, which seems to provide a cogent criticism. But after the study the treatment effect is no longer hypothetical, making a power calculation uninteresting unless someone intends to perform a new study in exactly the same circumstance. When a trial is finished, all of the information about the treatment effect is contained in the estimated value and its confidence interval; a post hoc power calculation adds nothing.
TYPES OF CLINICAL TRIALS
The use of clinical trials in oncology is similar to that in medical disciplines studying prevention, drugs, and devices. During the early developmental stages of new therapies, physicians evaluate evidence regarding related treatments and perform noncomparative clinical trials on the therapy under investigation. Statistical thinking is of great benefit in areas such as critical review of relevant literature, overviews of previous trial results, translational research studies, designing dose finding and toxicity studies, and designing studies to estimate treatment effects and feasibility.
Translational Trials
Only a small fraction of therapeutic ideas progress to developmental clinical trials. The transition from laboratory to clinic is guided by small targeted studies rather than large clinical trials. These small experiments, translational trials, may be the most common types of clinical trials performed. The methodology of translational clinical trials has not been fully formulated in the literature. This discussion is based on earlier work.^{[37]} Nearly every new therapy depends on a transition from laboratory to clinic. Translational studies can sometimes be part of later developmental trials, provided that the subjects and the questions are compatible with it. The traditional dividing line between laboratory and clinical development is often said to be the phase I study (discussed later). In fact the interface between laboratory research and clinical development is formed by translational clinical trials.
The outcome in a translational trial is a biologic marker (target) that may require validation as part of the study. This is not a surrogate outcome, because it is not used to assess clinical benefit, although it might anticipate later questions of clinical benefit. The action of the treatment on the target defines the next experimental steps to be taken. In particular, the absence of a positive change is evidence of inactivity of the treatment.
A biologic outcome provides definitive evidence—an irrefutable signal—within the accepted paradigm of disease and treatment. The signal must be positive to support further clinical development. Good biologic signals might be based on a change in levels of a protein or gene expression, or the activity of some enzyme. As an example we consider a new drug for secondary prevention. The goal might require reducing the presence of biomarkers in the tissue. The absence of these effects would necessitate discarding or modifying the drug (or choosing a different target). Positive biomarker changes might suggest the need for preclinical improvements. Neither result would establish clinical efficacy. The basic characteristics of a translational trial are described in Table 222 .
Table 222  Translational Trial: Basic Characteristics

“A clinical trial where the primary outcome: (1) is a biological measurement (target) derived from a well established paradigm of disease, and (2) represents an irrefutable signal regarding the intended therapeutic effect. The design and purposes of the trial are to guide further experiments in the laboratory or clinic, inform treatment modifications, and validate the target, but not necessarily to provide reliable evidence regarding clinical outcomes.”^{[37]}
Translational trials have continued experimentation as the primary context. The trial may provide evidence about the utility of more than one outcome. Many therapeutic ideas will prove useless during the laboratoryclinic iteration.
There are noteworthy deficiencies in these designs. One is the lack of proven clinical validity for the outcome. A second is the poor statistical properties of estimates. Not so obvious is that the translational trial paradigm confounds three effects: (1) the correctness of the disease paradigm, (2) the relevance of the biologic outcome, and (3) the action of the therapy. Errors in any of these can masquerade for either a positive or negative treatment effect. However, the correctness of the disease paradigm and the selection of a relevant outcome are usually based on strong evidence from earlier studies.
Dose Finding
Clinical trials that focus primarily on the relationship between dose and safety of new drugs or biologics are often termed phase I trials, particularly in oncology. Their purpose is to study drug distribution, metabolism, excretion, and toxicity and, in the case of cytotoxic drugs, to determine the dose associated with tolerable and reversible side effects. Until the last decade or so, statistical thinking contributed relatively little to the design of these studies. However, the relatively informal methods for dose finding have been steadily improved in recent years by statistical approaches. In oncology, dosefinding studies are often done in patients who have been previously treated with standard therapies. The features of classic phase I designs include (1) prior selection of a small set of drug doses to be tried, (2) treatment of a small number (e.g., three) patients at each dose with toxicity monitoring, (3) decision rules for stopping the trial based on clinical outcomes, (4) decision rules for escalating or deescalating the dose in a subsequent cohort. Often a few additional patients are studied at the final dose, with the total number of patients treated being usually less than 25 to 30. This type of design alleviates certain practical and ethical problems in administering agents with unknown properties to humans. For example, it tends to minimize the number of patients treated at high (toxic) doses of the drug. It also tends to treat relatively larger numbers of patients at lower (ineffective) doses. Such designs are inefficient. These properties tend to select conservative doses for later developmental testing.
Improved dosefinding designs have been suggested that correct such problems.^{ [38] [39]} In some of these designs, doses are not prespecified but are determined from the current results and a mathematical model of the dosetoxicity curve. The final sample size of the trial is not fixed in advance but depends on the toxicities observed. The continual reassessment method is a prototypical design with these features. Some appropriate variant of the continual reassessment method has to be considered the design of choice for dosefinding trial of cytotoxic agents.
Safety and Activity
After determining the pharmacologic properties of a new drug and a clinically useful dose, trials focus on obtaining evidence of treatment safety and activity. These are conventionally termed phase II trials or safety and activity trials. The principal question to be addressed in this stage of development is whether or not the new treatment has enough promise to warrant testing against standard therapy in a large comparative trial—that is, a rigorous study with an internal control group. In this middle developmental step, the study design usually chosen to answer this question is a single cohort trial with an external control group. The control group comparison is usually based on the literature, prior investigator experience, or consensus opinion as to what constitutes a worthwhile level of activity in a given disease.
Such studies usually make use of surrogate clinical outcomes instead of definitive outcomes such as survival. Surrogate outcomes are chosen because ideally they are known soon after treatment, are easily and accurately measured, and are thought to be informative with respect to later definitive outcomes. Tumor shrinkage (response rate) is a classic surrogate outcome for activity in this setting, based on the cytotoxic model where it would imply tumor cell killing.
Unfortunately, tumor shrinkage is a poor surrogate for survival. Furthermore, some therapies would not be expected to produce tumor shrinkage in the usual cytotoxic model. An example might be cytostatic agents. Some caution has to be exercised when making developmental decisions on the basis of surrogate outcomes. For certain agents, safety and activity trials should make use of definitive outcomes such as survival (overall failure rate), making them somewhat larger and lengthier than conventional designs. For other agents, such as those based on vaccines, there has been much attention given to the potential use of molecular biomarkers based on gene discovery associations with cancer prognosis as surrogate outcomes; this is discussed further in the latter part of this chapter.
Two types of designs are commonly used in middle development: fixed sample size and staged. In fixed sample size trials, the number of study subjects is chosen in advance, for example to yield a specified precision in the estimated response rate. Staged designs use a treatment evaluation after groups of subjects have been entered, permitting early termination of accrual if high or low response rates are observed. Excellent working designs can be obtained from only two stages.^{[40]} Numerous other statistical issues arise in the design and evaluation of phase II trials. Questions include patient selection, how to quantitatively evaluate response, patient exclusions, and the role of randomization. Space does not permit discussing these issues here. Reviews can be found in Buyse and colleagues.^{[13]}
Sample Size for Middle Developmental Trials
There is a large literature concerning the quantitative design of middle developmental (phase II) trials. Here we consider some simple concepts to illustrate the connection between biologic outcomes and study size. Consider a phase II trial in which patients with esophageal cancer are treated with chemotherapy before surgical resection. A complete response is defined as the absence of macroscopic and microscopic tumor at the time of surgery. We suspect that this might occur 35% of the time and would like the 95% confdence interval of our estimate to be ±15%. Approximate 95% confidence intervals for a proportion, P, are where n is the number of patients tested and 1.96 is the quantile from the normal distribution corresponding to a twosided probability of 5%. Substituting into this formula yields or n = 39 patients required to meet the requirements for precision. Because 35% is just an estimate of the proportion and some patients may not complete the study, the actual sample size might be increased slightly. Expected accrual rates may be used to estimate the required duration of this study in a straightforward fashion.
A useful, but rough, rule of thumb for estimating sample sizes needed for proportions may be derived in the same way. Because P (1  P) is maximal for P = 0.5, an approximate and conservative relation between n, the sample size, and w, the width of the 95% confidence interval is n = 1/w^{[2]}. Thus, to achieve a precision of ±10% (.10) requires 100 patients, and a precision of ±20% (.20) requires 25 patients. This inversesquare relation demands large sample sizes for high precision. This rule of thumb is not valid for proportions that deviate greatly from .5. For example, for proportions less than about .2 or greater than about .8, exact binomial methods should be used to estimate precision and sample size.
Similarly, consider a middle developmental trial in which a definitive outcome such as reduction in the overall failure rate is required. On a log scale, the confidence interval for the hazard ratio is logD ±Z_{a}/D, where D is the hazard ratio, d is the total number of failures, and Z_{a} = 1.96 for a twosided 95% interval. Like that for a response rate, this confidence interval can be made as small (precise) as necessary by observing more events.
Compared with a reference failure rate on standard therapy, a reduction of 33% on a new treatment (hazard = 0.2; ratio = 0.67) might be considered a useful improvement. If the reference failure rate is 0.3 per personyear (corresponding to median failure time of 2.3 years) and accrual proceeds at 75 subjects per year for 2 years with 1 additional year of followup, then we would expect to observe about 48 failures. This number of events would yield a precision of about ±0.06 (95% confidence interval) in the observed failure rate. Thus, such a study has to be larger and longer than a conventional safety and activity trial with a surrogate outcome.
COMPARATIVE STUDIES
Helping to design a comparative trial is a major responsibility for the biostatistician. The process involves detailed discussions with other investigators to resolve issues such as (1) what population should be studied, (2) are the treatment methods unambiguously defined, (3) how will patients be assigned to treatment groups, (4) how will outcomes be measured, and what can be done to assure that measurements will be obtained uniformly on all patients regardless of treatment assignment, and (5) what can be done to minimize loss to followup and to promote compliance to the treatment protocol.
In what follows, we outline some points of good design ( Table 223 ) that are not intended to be taken chronologically. In fact, many of them must proceed simultaneously. However, attention to each of these items will probably result in a stronger trial.
Table 223  Ten Concepts in Comparative Trial Design

Dual Roles of the Physician
Physicians who develop new treatments have two roles that are sometimes dissonant with each other. The first is as an advocate for the care and interests of the individual patient. The second is as a scientist representing the needs of others. The conduct of clinical trials is one of many areas where these roles can, but do not necessarily, conflict. From a clinical trials perspective, physician advocacy for the individual patient is an ideal and not an exclusive standard of conduct. This ideal is not met in circumstances of triage, allocation of scarce and expensive technologies such as organ transplantation, training of new physicians, and vaccination. All of these circumstances knowingly place some patients at risk for the benefit of others.
Even if we accept the individual advocacy ideal, some patients will always receive an inferior treatment as a result of physician error. Failure to learn from such mistakes can hardly be considered ethical. It is incumbent upon the physician to learn quickly and convincingly from the inevitable use of less effective treatments so that their scope of application is minimized.
Controlled experiments in the proper clinical setting are the most reliable way to accomplish this. Conversely, we must learn about efficacious new therapies as quickly as possible so that they can be used broadly.
Although clinical trials are conducted worldwide, there are concerns voiced occasionally about the ethics of randomization. Although any medical technology can be used inappropriately in specific instances, there is nothing inherently unethical in the use of randomization when it is used because physicians lack knowledge about the superiority of treatments and to eliminate bias so that the best possible evidence can be gathered. There will always be circumstances of collective uncertainty in which randomized treatment comparisons are the most ethical course of action.
In some circumstances, evidence becomes available during the conduct of the trial that one treatment is superior. This can happen, for example, if one treatment is unexpectedly better or worse or has unacceptable side effects. Investigators are ethically bound to learn of such circumstances as early as possible by monitoring the accumulating data and closing the inferior treatment arm if necessary. The administrative and statistical plans to meet this contingency require planning during the design phase of the trial.
Quantification of Objectives
An important task in designing a clinical trial and drafting the study protocol is to convert clinical objectives into quantitative measurements of outcome variables. For example, we might be interested to know if a certain therapy “results in lower morbidity.” However, the measurement of morbidity is not automatically well defined, particularly if the study involves more than one investigator. At least three aspects of morbidity must be defined. The first is a window of time during which adverse events can plausibly be attributed to the therapy. The second is a list of specific diagnoses or complications to be included. The third is a list of procedures required to establish each diagnosis definitively.
Definition of the Study Population Using Eligibility and Exclusion Criteria
Differences in eligibility criteria probably explain many of the discrepant results in the clinical literature from seemingly identical clinical trials. Even when several institutions use the same protocol, differences in interpretation of eligibility criteria and type of patients referred contribute to differences in outcomes. As a consequence, trials from different institutions and or periods of time may not be comparable even if the eligibility criteria are the same. This is one argument in favor of randomized concurrent controls.
To some extent, study results can be shaped by the eligibility and exclusion criteria. Consider how drug toxicity or operative morbidity can be reduced by the careful selection of patients. Age restrictions can reduce the number and severity of many chemotherapy toxicities, although such restrictions are seldom made explicit. Eligibility criteria can be used to define a more homogeneous study population, reducing the interpatient variability in outcomes. However, this will not necessarily reduce the size of a trial. For example, patients with poor prognosis may respond to treatment in the same way as those with good prognosis. If so, a trial excluding poorprognosis patients would be needlessly prolonged.
In many instances endpoints may be evaluated more easily if certain complicating factors are prevented by patient exclusion. For example, if patients with recent nonpulmonary malignancies are excluded from a lung cancer trial, evaluation of tumor recurrences and second primaries might be made simpler.
Ethical considerations also suggest that patients who are unlikely to benefit from the treatment (e.g., because of organ system dysfunction) not be allowed to participate in the trial. Whenever possible, quantitative parameters such as laboratory values should be used to make these definitions rather than qualitative clinical assessments. Some studies in patients with advanced cancer call for a “life expectancy of at least 6 months.” A more useful and reproducible criterion might be Karnofsky performance status greater than, say, 8.
Assessment of Accrual Resources
One unfortunate and preventable mistake made in clinical trials is to plan and initiate a study, only to have it terminate early because of low accrual. This situation can be avoided with some advance planning. First, investigators should be aware of the accrual rate required to complete a study in a certain fixed period of time. This is a bestcase projection. Most researchers would like to see comparative treatment trials completed within 5 years and pilot or feasibility studies finished within a year or two. Disease prevention trials may take longer. In any case, the accrual rate required to complete a study within the time targeted can be estimated easily from the total sample size required.
Second, investigators must obtain realistic estimates of accrual rates. The raw number of patients with a specific diagnosis can often be determined easily from hospital or clinic records but is a large overestimate of potential study accrual. It must be reduced by the proportion of subjects likely to meet the eligibility criteria, and again by the proportion of those willing to participate in the trial (e.g., consenting to randomization). This latter proportion is usually less than half. Study duration can then be projected based on this potential accrual rate, which might be one fourth to half of the patient population.
Third, investigators can project trial duration based on a worstcase accrual. The study may still be feasible under such plans. If not, plans for terminating the trial because of low accrual are needed so as not to waste resources. In particular, accrual estimates from participating institutions other than the investigators’ own are suspect. How long will the study take as a singleinstitution trial?
To estimate accrual more accurately, participants can be formally surveyed before accrual starts. As patients are seen over a period of time, a record can be kept to see if they match the eligibility criteria. To estimate the proportion willing to give consent, one could briefly explain the proposed study and ask if they would hypothetically be willing to participate.
Treatment Specification
Control of treatments and their allocation is a defining characteristic of true experimental designs. In practical situations, explicit plans are needed for modifications in the treatment of individual patients. To satisfy scientific objectives, essential components of the therapy should be guided by the protocol, but modifications that are unlikely to affect the outcome should be left to the treating physician.
Physicians participating in trials are always obligated to replace protocol treatments with others when they feel that it is in the best interests of the patient. However, sufficient flexibility in the treatment specification, especially concerning complications, toxicity, or side effects, may permit most patients to continue following the protocol. This could contribute more information to the trial results and enhance the credibility of the study report.
Definition of Endpoints and Methods of Assessment
Selection of endpoints and the use of prospective methods of assessment greatly affect the strength of a trial. Important characteristics of the endpoint are that it correspond to the scientific objectives of the trial, as well as to be biologically meaningful within the context of the therapy or intervention under investigation; see further discussion on this point in the Design section. The method of assessing endpoints should be accurate and free of bias. This is helpful for both subjective endpoints and objective ones such as survival and recurrence time. Even when using welldefined event times, incomplete followup can create bias.
From a biostatistical perspective, there are three types of endpoints that are likely to be used widely in oncology trials: (1) continuously varying measurements, (2) dichotomous outcomes, and (3) event times. We will briefly discuss each of these.
Continuously Varying Measurements
Measurements that can theoretically vary continuously over some range are common and useful types of assessments. Examples include many laboratory values, blood or tissue levels, functional disability measures, or physical dimensions. In a study population these measurements have a distribution, often characterized by a mean or other location parameter, and variance or other dispersion parameter. Consequently, these outcomes will be most useful when the primary effect of a treatment is to raise or lower the average measure in a population. Typical statistical tests that can detect differences such as these include the ttest or a nonparametric analog and analyses of variance (for more than two groups). To control the effect of confounders or prognostic factors on these outcomes, linear regression models might be used.
Dichotomous Measures
Some assessments have only two possible values, for example, present or absent. Examples include some imprecise measurements such as tumor size, which might only be described as responding or not, and outcomes like infection, which is either present or not. Inaccuracy in measurement can make a continuous value ordinal or dichotomous. In the study population these outcomes will be frequently summarized as a proportion. Comparing proportions might lead to tests such as the chisquare or exact conditional tests. Another useful population summary is the odds or logodds. The effect of prognostic factors or confounders on this outcome can often be modeled using logistic regression.
Event Times
Event times are common and useful outcome measurements in clinical trials. Survival time and diseasefree or recurrence time are wellknown examples. However, many other intervals might be of clinical importance, such as time to hospital discharge or time spent on a ventilator. The distinguishing feature of event time outcomes is the possibility of censoring. This means that some subjects under observation may not experience the event by the end of the study. Using the information in the censored observation time requires some special statistical procedures. In the study population, event time or “survival” distributions (e.g., life tables) might be used to summarize the data. Clinicians are also accustomed to seeing medians or fixed time proportions used to summarize these outcomes. Perhaps the most useful summary is the hazard, which can be thought of as a proportion adjusted for followup time. The effect of prognostic factors or confounders on hazard rates can often be modeled using survival regression models.
Other Endpoints
There are several other types of endpoints that are important for some medical studies but are not used frequently in oncology. These include counts, multiplecategory outcomes, ordered categories, disease intensity measures, and repeated measurements. For example, units of blood used might be described as a count and chemotherapy toxicities are often described in ordered categories. As another example, cytogenetic responses are often recorded as categorical outcomes, and have received much attention for their use in assessing minimal residual disease (MRD) and in vaccine trials of chronic myeloid leukemia (CML). There has also been much discussion among clinical trial designers recently concerning the use of intermediate endpoints that become known early after treatment but are very reliably associated with definitive outcomes. Examples include premalignant lesions in cancer prevention studies and CD4 lymphocyte counts in acquired immunodeficiency syndrome (AIDS). Intermediate endpoints are probably not as relevant to oncologic studies as to these others areas of study. One notable exception is the use of prostatespecific antigen to monitor prostate cancer recurrence.
Control Treatment Allocation and Bias
Randomization
Randomization is one of the most effective means for reducing bias, because it guarantees that treatment assignment will not be based on patients’ prognostic factors. The benefits derived from randomized treatment assignment are well known.^{ [36] [41]} Following randomization, treatment differences can be attributed to the true treatment effect plus random variability.
One argument against randomization is that it is unnecessary because confounders can be controlled in the analysis by using statistical adjustment procedures. The extent to which this can be done relies upon two additional assumptions: (1) the investigators have measured the confounders in the experimental subjects, and (2) the assumptions of the statistical models or other adjustment procedures are known to be correct. Randomization is a more reliable method than adjustment, because it controls bias without these assumptions. Moreover, it controls the effects of confounders whether they are known to the investigator or not. Critics of randomization often overlook this last point, which provides randomized studies with their high degree of credibility.
Blinding or Masking
Masking (blinding) is another biasreducing technique in which the patient (singleblind), physician (doubleblind), and perhaps the monitors (tripleblind) in a clinical trial are unaware of the individual patient treatment assignments. As a result of blinding, treatment assessments can be made without prejudice, increasing the utility of both objective and subjective outcomes. Masking of drugs is often simple to implement, particularly with the assistance of a hospital pharmacy or pharmaceutical company. In oncology, treatment masking is frequently possible though sometimes logistically impractical.
Calculation of Quantitative Properties of the Design (Precision, Power, Duration)
Questions regarding the quantitative properties of clinical trial designs are among those most frequently asked by clinicians. It is not possible to specify a universally valid approach to answering such questions. Instead, we provide some basic ideas and examples. For a more statistically oriented review, see Donner.^{[42]} Although computer software is available to perform many power and sample size calculations (e.g., Hintze^{[43]}), most programs are written for a statistical user.
Two basic considerations in estimating precision and power are the purpose of the trial and its primary endpoint. For noncomparative designs, the goals of the study are often to estimate some useful clinical quantity with a specified precision. Examples of endpoints with clinical interest are average blood or tissue levels of a drug, the proportion of patients responding or meeting other predefined criteria, or population failure rates. A useful measure of precision is the confidence interval of the estimate. For example, narrow 95% confidence intervals indicate a higher degree of certainty about the location of a true effect than wide 95% confidence intervals do. Because confidence intervals depend on the number of subjects studied, targets for precision can often be translated into requirements for sample size.
Comparative clinical trials require more complicated methods to estimate sample size and power. Often, comparative studies are designed to yield statistical hypothesis tests with desirable properties, such as a high power to detect important clinical differences reliably. In trials with survival time as the primary endpoint, the power of the study depends on the number of events (e.g., recurrences or deaths). Confusion can arise over the number of patients placed on study versus the number of events required for the trial to have the intended statistical properties. As a test of equality between treatment groups, it is common to compare the ratio of hazard (or failure) rates (see following definition) versus 1.0. Under fairly flexible assumptions, the size of such a study should satisfy where d is the total number of events needed on the study, D is the ratio of hazards in the two treatment groups, and Z_{a} and Z_{b} are the normal quantiles for the type I and II error rates.^{[43]}
For example, with this formula, to detect a hazard rate of 1.75 as being statistically significantly different from 1.0 by using a twosided 0.05 αlevel test with 90% power requires
This is not the final sample size, as suggested by the safety and activity trial example discussed earlier. A sufficient number of patients must be placed in the study to yield 141 events in an interval of time appropriate for the trial. For example, if 50% of patients remain eventfree (censored) at the end of the trial, 282 subjects are required. In general, the sample size, n, is n = d/(1  p), where p is the proportion censored.
TRIAL IMPLEMENTATION
Establishment of Procedures for Managing Data
All trials require certain minimal standards for collecting, quality controlling, and reporting data. Although many investigators use their own staff to perform such duties, sufficient skill is required to suggest that these activities be housed in groups dedicated to the purpose. Resources for this might exist on a departmental or institutional level. In other circumstances, an external or privately run coordinating center might be used. In no cases should this reduce access to the data or substitute for skilled translation of data elements from clinical and laboratory sources to study database.
There are at least five conceptual components to processing information from patients on a clinical trial: (1) eligibility check and registration/randomization, (2) data acquisition from the clinical record, (3) editing, error checking, building a database, and quality control, (4) interim reporting, and (5) analysis. Each of these, when properly performed, will reduce the frequency and severity of certain types of errors. Although one could write extensively about this subject, we summarize only a few important points about each component.
The eligibility check and registration is a simple but important quality control point. Even the knowledge that eligibility will be impartially checked causes many investigators to take entry criteria more seriously. Usually, a phone call requiring only a few moments is all that is needed. Using this opportunity, an identifying number can be assigned, a database record can be started, the pharmacy can be notified (if necessary), and other study bookkeeping can be initiated.
Data acquisition from the clinical record must be performed by an individual with sufficient clinical, protocol, and medical record knowledge. In some cases this requires the investigator's expertise, whereas in other circumstances a research nurse or data specialist can succeed. Unless studies are subject to external auditing, it is uncommon to catch errors made at this stage. Thus the principal investigator can have a major beneficial impact on the quality of data by being active here.
A simple and straightforward system for building and managing a database might begin with paper records or data forms that contain the information of clinical importance to the study. It is not necessary to record all the information needed for the care of the patient but rather only those items that correspond to the outcomes and objectives of the study. Ideally, one would not collect any items that do not need analysis. Information from these forms can be transcribed onto an appropriate computer database. Numerous quality control checks and edits are necessary to be certain that the database produced from paper records accurately represents the clinical record. For example, audits may compare the database with the chart. Within a single patient's record, computerized checks of bounds and internal consistency can be performed. When reviewed by a knowledgeable person, lists and summaries of the data can trap many errors.
Computer software has made some of these tasks more simple and reliable. Many investigators use spreadsheets to assist with these tasks in small studies, although database software is more powerful. When existing databases and human resources are available, investigators should attempt to use them rather than building a system for each study independently. In any case, the investigator should understand the flow of data from the clinical record through the final analysis. This flow of data, however, may involve metadata through the use of genomic information into clinical trials, in which case new technologies for storing, retrieving, and analyzing such information creates a challenge for medical informatics. Until recently, medical informatics has been an established field that pioneered the development and introduction of informatics methods in clinical medicine. With the increasing desire to transfer genomic results into medicine, bioinformatics has posed a challenge to the field of medical informatics for the development of novel clinically oriented methods to ensure success in such a transfer.^{ [45] [46]} Indeed, the new era of genomicsbased approaches to medicine, such as in defining and evaluating genomerelated risk factors for various diseases, developing diagnostic tests, creating updated cancer cell classifications, or integrating genetic and medical data in clinical practice, will require support from both bioinformatics and medical informatics to address the collecting, quality controlling, and reporting data as part of clinical trials design, and prompts the need for new standards in such areas.^{[47]}
Interim reporting serves several purposes. It provides an opportunity for the investigators to review accumulating data related to administrative aspects of the study such as accrual rates and delinquent observations. Complication and toxicity rates can be reviewed to be certain that the type, frequency, and severity of such events is reasonable. Also, efficacy endpoints can be reviewed, following appropriate statistical guidelines, to satisfy ethical concerns. Much has been written about these subjects, and we discuss more details in the following sections.
Establishment of Procedures for Monitoring
Plans for monitoring and early stopping of accrual are another element of good trial design that can greatly alleviate problems in conducting studies. Researchers have an ethical obligation to learn about treatment differences as quickly as possible and to minimize the number of patients who are placed on a convincingly inferior treatment. By planning for early termination of accrual when unexpectedly large differences are observed, investigators can make a clinical trial more acceptable to other researchers and patients. A full discussion of sequential and group sequential methods for use in this context is beyond the scope of this chapter. These methods are now commonly implemented using welldescribed techniques.^{ [48] [49]} Investigators can insure that monitoring methods are effectively implemented by having the accumulating data reviewed formally at intervals by a monitoring committee.
Repeatedly performing statistical significance tests on accumulating data increases the overall type I error. If 10 interim analyses were conducted with the conventional significance level of 0.05, the resulting overall type I error might be as high as 15%. This inflation of the type I error can be even higher if many interim looks at the data are performed. To compensate for this and control the type I error, investigators must prospectively plan the analysis points and the significance level for each analysis.
To control the overall type I error at 5%, each interim look should use a significance level smaller than 0.05. For example, in a clinical trial comparing response rates and twosided testing for significance five times, a frequently used group sequential method^{[50]} indicates that the analyses should be conducted with significance levels of approximately 0.0000075, 0.0013, 0.0085, 0.023, and 0.041, to control the overall type I error at the conventional 5%. Using this method, note that the final analysis is conducted using a significance level near, but less than, the usual 5%. Early in the trial, achieving statistical significance is more difficult.
Unplanned interim analyses may have undesirable properties and can pose serious problems in interpretation for researchers and regulators. Attempts have been made to alleviate this problem by retrospectively applying group sequential methods, although this has difficulties of its own. Another alternative for monitoring is the use of Bayesian statistical methods, which have much appeal to clinicians but have not gained as widespread use as other methods.^{[51]} For a general discussion of monitoring alternatives, see Gail,^{[52]} and for a practical discussion see DeMets^{[53]} or O'Fallon.^{[54]} In any case, the time to plan properly for trial monitoring is during the design phase of the study.
A second reason for terminating a trial early is when interim analyses demonstrate the near equivalence of the treatments and continuing the trial would be unlikely to demonstrate clinically significant differences. In this circumstance, early stopping has been based on conditional power calculations.^{[55]} Using this technique reduces the size and length of trials that show no effect or treatment difference but still yields clinically useful information.
ANALYSIS
The exact procedures necessary for analyzing a clinical trial depend on the design and purposes of the study. For example, pharmacologic studies might require modeling and estimation of physiologic parameters in each patient to meet their objectives, whereas comparative trials usually require summaries of relative treatment effects and confidence intervals. Analyses for these differing types of studies seem to have very little in common. However, when we consider that all trials should inform us about the population being studied, the need for unbiased statistical estimation of clinical effects, and the need to summarize data in the most clinically useful form, much common ground is evident.
The approach we recommend to analysis and reporting emphasizes estimation rather than hypothesis testing.^{ [29] [30] [31]} Measuring and reporting clinical effects and associated estimates of variability (or confidence intervals) is more informative and useful than focusing attention on formal tests of statistical hypotheses and P values. A simple example should make the difference clear. Suppose a clinical trial is performed comparing two treatments, A and B, and the major outcome is survival. Investigators might perform a statistical hypothesis test comparing treatments A and B and report “survival on treatment A is significantly longer than on B (P < 0.05).” Alternatively, when emphasizing estimation, investigators might report the estimated hazard ratio (A vs. B) for death was 2.0 (95% confidence limits 1.5–2.3). In the first case, the reader is left only with a P value to summarize the data, whereas in the second case the treatment difference is described more completely. Some journals have adopted guidelines for reporting.^{[56]} Our recommendations are similar in spirit to those. Although we have suggested some specific statistical methods and summaries for certain kinds of data, there are additional or alternative analytic procedures that should be adopted in special cases and we do not seek to limit analyses or reports. However, the basic concepts and approaches outlined here should prove to be helpful both clinically and statistically for correctness, lack of bias, completeness, and consistency. In this spirit we offer the following basic steps in the analysis of clinical trials ( Table 224 ). We emphasize that thesesteps are conceptual and do not necessarily occur in the order listed. Also, some steps are relevant only to randomized or comparative trials.
Table 224  Basic Steps in the Analysis of Clinical Trials

Intention to Treat
It is unfortunate that investigators conducting clinical trials cannot guarantee that the patients who participate will definitely complete (or even receive) the treatment assigned. Thus, a clinical trial can be viewed as a test of treatment policy, not a test of treatment received. Many factors contribute to patients failing to complete the intended therapy including severe side effects, disease progression, strong preference for a different treatment, and a change of mind. Many such factors are strongly correlated with outcome, which can render a strong bias if such patients are removed from the analysis.
From a clinical perspective, postentry exclusion of eligible patients is essentially an attempt to use information from the future. When selecting a therapy for a new patient, the physician is primarily interested in the unconditional probability that the treatment will benefit the patient. Because the physician has no knowledge of whether or not the patient will complete the treatment intended, inferences that depend on events in the patient's future (i.e., adherence to therapy) are not helpful to that goal. In other words, adherence is both an outcome of the trial as well as a potential predictor. These two roles of adherence cannot be disentangled by removing patients from consideration. Consequently, the physician will be most interested in clinical trial results that include all patients who were assigned to the therapy.
To be certain that the trial results closely reflect the effect of the treatment, the eligibility criteria should exclude patients with characteristics that might prevent them from completing the therapy. For example, if the therapy is lengthy, perhaps only patients with good performance status should be eligible. If the treatment is highly toxic, only patients with normal function in major organ systems will be likely to complete the therapy.
Following on these considerations, the most important analysis includes all patients registered or randomized on the trial regardless of postentry events. This analysis is the intentiontotreat analysis. It is possible to exclude patients who were retrospectively found not to meet the eligibility criteria—that is, those who were mistakenly placed on study—without creating bias. Ideally, such patients would not have gone on study because they would have been found to be ineligible. However, only eligibility or preentry criteria should be used to make such exclusions. If patients are excluded on the basis of “evaluability” or other postentry criteria, the possibility of bias increases. Evaluability criteria are outcomes, no matter how well defined clinically. If we exclude subjects based on outcomes, the potential for bias is great.
Examination of the Data
The first practical step in any analysis is to look at the data. This includes examining lists and other simple tabulations that might highlight incorrect data values. Many problems in analyzing clinical trials can be prevented by correcting errors that become apparent in this way. This is also a step that knowledgeable investigators can perform quickly but very efficiently. With the widespread use of computers and automated analysis procedures to manage clinical information, it is possible to produce results from clinical studies without carefully examining the data. This is unfortunate, because even a cursory examination of raw data by a technically knowledgeable person can detect many errors of importance to the analysis.
Some of the errors that are amenable to detection by inspection include (1) incorrectly missing data (patient had level measured but not recorded in the database), (2) incorrect decimal points (80 recorded instead of 8.0), (3) failure to convert numerical codes for special values, (calcium becomes 99 instead of “missing”), (4) outofrange or impermissible values (0.0 recorded instead of 8.0), (5) mislabeled variables (age is mistaken for calcium and vice versa), and (6) coding and recoding errors (0 should mean normal and 1 should mean abnormal, but values are reversed).
Inspection of the data is particularly important for small or singleinvestigator studies in which the data management techniques are not subject to regular quality control procedures as might be the case in multiinstitutional cooperative group studies. Errors in small studies can be particularly influential. Many times, small studies are recorded entirely on paper with transcription to a computer at a later time, creating another opportunity for errors. Other times, data are stored on computers using convenient but unsophisticated software such as spreadsheets rather than database management programs that permit validation and checking. Fortunately, the quantity of data from such small studies is often very amenable to checking by inspection. It is embarrassing, frustrating, and bad for morale to have to ask that analyses be repeated because data errors were discovered late.
Description of the Study Population
Clinical trials are studies of particularly welldefined and often relatively small cohorts. Although the eligibility criteria define a target population of particular interest, the patients actually accrued on a trial may differ because of chance or subtle institutional characteristics. Investigators will want to describe the observed cohort, particularly with regard to important prognostic factors. Simple population measures and summary statistics usually suffice for this purpose. This process is also both a byproduct of and valuable in error checking.
Verification of the Comparability of Treatment Groups
In reports of many randomized studies, the first table presented is often intended to show the comparability of treatment groups. Actually, a lack of statistically significant differences between the treatment groups does not guarantee the absence of influential imbalances but only demonstrates the effectiveness of randomization. Even so, this is important, because readers will have increased confidence in the validity of the findings if imbalances are either absent or detected and controlled in the analyses. Although we will take note of any statistically significant differences between groups, nonsignificant imbalances in strong prognostic factors can influence treatment comparisons. This is discussed more completely in the later section on deciding when to adjust. Second, and conversely, statistically significant imbalances are not necessarily influential; the imbalance may occur in an inconsequential factor. For the clinician comparing groups, the magnitude of the difference is more important than the Pvalue.
Estimation of Treatment and Prognostic Effects
As mentioned previously, some outcomes that are likely to be useful in oncology trials are group averages (or differences between group averages), probability of response (or odds ratios), and hazards (or hazard ratios). We omit discussion of methods for group averages, because they are well known and focus on dichotomous outcomes and event times. These outcomes have similarities with respect to their summary statistics and presentation. Odds ratios are useful summaries of data to describe the effects of dichotomous variables. For example, differences in the probability of response might be described by an odds ratio. Similarly, hazard ratios are useful for describing differences in risk of failure over time. For example, differences in recurrence or survival curves might be described by a hazard ratio.
To illustrate these and other aspects of the estimation of clinical effects, we consider simulated data from a hypothetical randomized trial comparing two treatments (A and B) for solid tumors. Simple randomization was used in this study, with 101 patients on treatment A and 99 patients on treatment B. Data on response to treatment were collected as an example of a dichotomous outcome, and patients were followed for survival as an example of an event time endpoint. Differences in response and survival attributable to sex are also thought to be important. Nonparametric estimates of survival for subgroups defined by treatmentsex combinations are shown in Figure 221 .
Figure 221 Survival by treatment group and sex on a hypothetical clinical trial. 
The advantage to using simulated data, aside from convenience, is that the “true” treatment and covariate effects are known. In this case, the true treatment effects were a 4fold odds of response and a 2fold risk of death in favor of treatment A. For sex, the odds of response was 2fold and the risk of death was 1.5fold, both in favor of females. Response and survival were independent of one another. In what follows, the estimated effects will differ from these values because of random variation.
Odds and Hazard Ratios
The simulatedresponse data for the two treatment groups are shown in Table 225 . The estimated response rate (probability) on treatment A was 0.376 compared with 0.182 on treatment B. The odds of response for treatment A is
Table 225  Responses on Treatments for Solid Tumors
Treatment 
Response 
No Response 
Group A 
38 
63 
Group B 
18 
81 
A more useful quantity for judging the relative effect of treatment on response is the odds ratio. The estimate of the overall odds ratio for group A versus group B, ORd_{AB} is
Because the odds of response on treatment A are almost threefold higher than on treatment B, this might be a clinically important difference. The decision to use the odds ratio of A versus B (or vice versa) is purely a matter of convenience. The data relating sex and response are shown in Table 226 .
Table 226  Example of Odds Summary Data
Group 
Sex 
No. of Patients 
No. of Responses 
Response Odds 
A 
Males 
53 
15 
.395 

Females 
48 
23 
.920 

Overall 
101 
38 
.603 
B 
Males 
51 
8 
.186 

Females 
48 
10 
.263 

Overall 
99 
18 
.222 
In treatment group A, the odds ratio for males versus females is 0.395/0.920 = 0.429. In group B the corresponding ratio is 0.707. Thus, it seems that male subjects are less likely to respond than female subjects, and this is explored further later in the analysis.
For event time endpoints, the quantities of interest are the number of events in the groups and the total followup or exposure time (in years; Table 227 ).
Table 227  Example of Hazard Summary Data
Group 
Sex 
No. of Patients 
Exposure Time 
No. of Deaths 
Hazard 
A 
Males 
53 
2768 
40 
.014 

Females 
48 
2846 
30 
.011 

Overall 
101 
5614 
70 
.012 
B 
Males 
51 
1632 
42 
.026 

Females 
48 
1802 
35 
.019 

Overall 
99 
3434 
77 
.022 
The total exposure time is obtained by summing all followup times without regard to censoring. This represents the aggregate time at risk for the group. Thus, the estimated overall hazard of death in group A, l_{A}, is and for group B, is
From these data we would conclude that the risk of death following treatment B is higher and that this difference might be of clinical importance. Here also, sex seems to influence the risk of death. The estimated hazard ratios for males versus females are 1.27 and 1.37 on treatments A and B. The effect of sex will be explored later in more detail.
This example has shown the utility of odds and hazard ratios in summarizing clinical effects. The next section will illustrate additional utility for confidence intervals.
Confidence Intervals
Informally, a confidence interval is a region in which we are confident that a true parameter or effect lies. Although this notion is not too misleading, confidence intervals are really probability statements about an estimate and not about the true parameter value. A 95% confidence interval indicates the region that would contain the true parameter value 95% of the time if we repeated the experiment. In other words, given the estimates resulting from a series of experi ments, the true value will fall within the 95% confidence regions 95% of the time.
The value of confidence intervals is that they convey both the magnitude of the estimated clinical effect and a sense of its precision. In many cases, simple hypothesis tests are analogous to the confidence interval. Also, when summarizing results from several studies, estimates and confidence intervals are more useful than P values.
Continuing with the preceding example of the randomized clinical trial, we first consider confidence intervals for the probability of response. Using the methods already discussed, an approximate 95% confidence interval for the probability of response on treatment A is
Similarly, an approximate 95% confidence interval for response on treatment B is 0.182 ± 0.076 = [0.106  0.258]. Because of the large sample size and the intermediate size of the probabilities, these intervals are close to those that would be obtained by using exact binomial methods, which are [0.282  0.478] and [0.  0.272] for groups A and B, respectively.
For odds and hazard ratios, calculating confidence intervals on a log scale is relatively simple. An approximate confidence interval for the log odds ratio for A versus B is where Z_{a} is the point on normal distribution exceeded with probability a/2 (e.g., for a = 0.05, Z_{a} = 1.96). This yields a confidence interval of [0.35–1.65] for the log odds ratio or [1.41–5.21] for the odds ratio. Because the 95% confidence interval for ÔR_{AB} excludes 1.0, the difference is “statistically significant.” The statistical test of the null H_{0}:ÔR_{AB} = 1.0 is rejected with significance level P = 0.003.
A similar method can be used for the hazard ratio. An approximate confidence interval for the log hazard ratio is
This yields a confidence interval of [0.264–0.911] for the log hazard ratio or [1.30–2.49] for the hazard ratio. Again, the confidence interval excludes 1.0, indicating a statistically significant difference in the death rates between the groups (P < 0.001).
Problems with P Values
There are many circumstances in which P values are useful, particularly in welldesigned hypothesis tests. However, P values have properties that make them poor summaries of clinical effects. In particular, P values do not convey the magnitude of a clinical effect. The size of the P value is a consequence of two things: the magnitude of the estimated treatment difference and its estimated variability (which is itself a consequence of sample size). Thus, the P value partially reflects the size of the experiment, which has no biologic importance. The P value also hides the size of the treatment difference, which does have major biologic importance.
Some investigators conclude things like “the effect might be statistically significant in a larger sample.” This, of course, misses the point, because any effect other than zero will be statistically significant in a large enough sample. What the investigators should really be talking about is the size and clinical significance of an estimated treatment effect rather than its P value. In summary, P values only quantify the type I error and incompletely characterize the biologically important effects in the data.
To illustrate the advantage of estimation and confidence intervals over P values, consider a discussion over the prognostic effect of perioperative blood transfusion in lung cancer.^{ [57] [58] [59] [60] [61] [62] [63]}Several studies (not clinical trials) of this phenomenon have been performed because of firm evidence in other malignancies and diseases that blood transfusion has a clinically important immunosuppressive effect. Disagreement over the study results has stemmed, in part, from too strong an emphasis on hypothesis tests instead of accepting the estimated risk ratios and confidence limits. Some study results are shown in Table 228 . Although the authors of the various reports came to different conclusions about the risk of blood transfusion because of differing P values, the estimated risk ratios adjusted for extent of disease seem to be consistent across studies. Based on these results, one might be justified in concluding that perioperative blood transfusion has a modest adverse effect on lung cancer patients.
Table 228  Summary of Studies Examining the Perioperative Effect of Blood Transfusion in Lung Cancer
Study 
Endpoint 
Hazard Ratio^{[*]} 
95% Confidence Limits 
Tartter et al.^{[58]} 
Survival 
1.99 
1.09–3.64 
Hyman et al.^{[59]} 
Survival 
1.25 
1.04–1.49 
Pena et al.^{[60]} 
Survival 
1.30 
0.80–2.20 
Keller et al.^{[61]} 
Recurrence 



Stage I 
1.24 
0.67–1.81 

Stage II 
1.92 
0.28–3.57 
Moores et al.^{[62]} 
Survival 
1.57 
1.14–2.16 

Recurrence 
1.40 
1.01–1.94 
^{*} 
All hazard ratios are transfused versus untransfused patients and are adjusted for extent of disease. 
Adjustments
Not all clinical trial statisticians agree on the need for adjusted analyses in clinical trials. However, many investigators believe that the difference in estimated treatment effects before and after adjustment often conveys useful knowledge. Furthermore, nonrandomized studies, such as cohort studies, are invariably analyzed with adjustment for confounders or prognostic factors. The same kinds of systematic errors that can arise in observational studies can arise in clinical trials by chance. This seems to provide a firm rationale for examining the results of adjusted analyses.
One of the principal advantages of using statistical models to help analyze trial results is the straightforward generalization to multiple regressions suitable for adjusted analyses. Using these methods, investigators can estimate the treatment effect while adjusting for prognostic factors. One should consider adjusting for variables that meet any of three criteria: (1) prognostic factors that are statistically significantly imbalanced between the treatment groups, (2) strong or influential prognostic factors, whether imbalanced or not, (3) to prove that a particular prognostic factor does not artificially create the treatment effect.
The philosophy underlying adjusting in these circumstances is to be certain that the observed treatment effect is not due to confounding. The effects of clinical interest with adjustment are changes in relative risk parameters rather than changes in P values.
Regression Methods
Regression is an unfortunate historically anomalous name for a very important statistical method. A more descriptive name might be statistical modeling of multiple effects. In any case, the essential idea is to relate an outcome of interest to one or more predictor variables using a statistical model. The theoretical components of the model are its deterministic form (structural equation), probabilistic form (how it models errors), and parameters (biologic constants), whereas the empirical components are the observed data. If the model is approximately correct, it should predict the observed data well provided we choose appropriate parameter values. Conversely, we can choose those parameter values that make the predictions and observations (data) close in some welldefined way. This latter sense is the way in which most statistical models are used. Trustworthy fitting methods exist, such as maximum likelihood, to estimate the best parameter values.
If the model has been constructed so that the parameters also correspond to clinically interesting effects, it yields a way of estimating the influence of several factors simultaneously on the outcome. In practice, models also provide a means for obtaining confidence intervals, testing hypotheses, and even revising the model itself.
Statistical models such as logistic regression for dichotomous outcomes and survival regression for event times are likely to be useful both for estimating odds and hazard ratios and performing multiple regression adjustments. Provided the assumptions of these models are met, they can provide estimates of the appropriate relative risk parameter(s), confidence limits, and P values.
We return to the hypothetical randomized clinical trial outlined previously, in which sex seemed to influence response rate and survival. For response, an appropriate statistical model is the logistic regression model, the results of which are shown in Table 229 .
Table 229  Logistic Regression Models Illustrating Adjusted Treatment Effects
Model 
Variable 
Odds Ratio 
95% Confidence Limits 
P Value 
1 
B vs. A 
0.368 
0.19–0.71 
0.003 
2 
Male vs. female 
0.542 
0.29–1.01 
0.055 
3 
B vs. A 
0.358 
0.19–0.69 
0.002 

Male vs. female 
0.520 
0.27–0.99 
0.046 
Models 1 and 2 show the overall odds ratios, confidence limits, and P values for treatment and sex considered individually. Model 3 shows the joint effects of sex and treatment group on response. When the effect of treatment is taken into account, females are seen to have a higher response odds. Because the estimated odds ratios don't change very much after adjustment, this suggests that sex and treatment have nearly independent effects on response. For the survival endpoint, the results of proportional hazards regression models are shown in Table 2210 .
Table 2210  Proportional Hazards Regression Models Illustrating Adjusted Treatment Effects
Model 
Variable 
Hazard Ratio 
95% Confidence Limits 
P Value 
1 
B vs. A 
1.91 
1.36–2.66 
<0.001 
2 
Male vs. female 
1.32 
0.95–1.82 
0.100 
3 
B vs. A 
1.92 
1.37–2.68 
<0.001 

Male vs. female 
1.34 
0.96–1.85 
0.083 
The estimated hazard ratios are quantitatively similar to those determined previously and show a higher risk for males. Differences are due to different methods of calculation. The effect of treatment controlling for sex is significant. The adjusted hazard ratios (model 3) also suggest independent effects for sex and treatment on survival time.
Special Methods
Frequently, special analyses are needed to address specific clinical questions or secondary goals of the trial. In some prognostic factor studies, special regression models may be needed, such as timedependent covariate models, to account correctly for the effects of predictors that change over time. Other examples of situations that may require sophisticated analytic methods are repeated longitudinal measurements, Bayesian methods, nonindependent observations, and accounting for restricted randomization schemes. Aside from the lack of software for many needs, extra care is required to be certain that the assumptions of the analytic methods are met.
Repeat Analyses
Although we have tried to be firm about the value of the intentiontotreat analysis, there are circumstances in which one would like to know if the exclusion of some patients on the basis of clinical criteria affects the results. One such situation is the exclusion of ineligible patients in a randomized trial. Actually, exclusions based on eligibility criteria do not violate the intentiontotreat principle. However, investigators may sometimes feel the need to exclude eligible patients. One can consider repeating steps 1–6 after doing so. Provided the fraction of patients excluded is small, say 5%, and affect both treatment groups (if the trial is comparative), it is likely that the results will agree with the intentiontotreat analysis. This is as much an argument not to exclude patients as it is to allow exclusions.
Data Exploration
Clinicians generally need very little encouragement to conduct exploratory analyses of their data. By exploratory analyses, we mean those that do not follow directly from the design of the experiment. Such analyses are neither automatically inappropriate nor wrong. However, the conclusions derived from these analyses can be unreliable. Therefore, they should serve only to generate hypotheses to be tested more rigorously in the future. Reasons why exploratory analyses may be unreliable include the following:

• 
A comparison suggested by the data and not by prior hypothesis is likely to have a type I error larger than the nominal P value. This occurs because investigators have a tendency to only test those differences that are large, most of which are probably due to chance. 

• 
An analysis that excludes patients based on postentry criteria (responses) will probably produce biased results. 

• 
Subset analyses are likely to be influenced by uncontrolled prognostic factors. 

• 
Investigating large numbers of subsets can lead to “significant” differences purely by chance (i.e., inflated type I error). 
By relying on estimation of clinical effects rather than unplanned tests of statistical hypotheses, the utility of these exploratory analyses might be increased. Investigators might be less likely to misinterpret the results or to exaggerate their clinical utility. In any case, these types of exploratory analyses should never be the primary analysis of a clinical trial.
PUBLICATION AND INTERPRETATION
As with analysis, the most informative summaries and amount of detail to report from a clinical trial will depend largely on the nature of the clinical hypotheses being studied. This section will outline basic reporting guidelines ( Table 2211 ) that follow an estimation and confidence interval approach and that should be helpful for reporting many types of clinical trials. These guidelines should also be useful for reviewing and interpreting published reports of trials and prognostic factor analyses. Reports of clinical trial results may be subject to constraints that analyses are not. For example, reports often require a consensus among investigators and must undergo an imperfect editorial process before publication. We can offer little help here in navigating these difficulties except to suggest a certain minimal content and structure.
Table 2211  Ten Concepts in Reporting

Description of the Study Population
Clinically relevant descriptions of both the study and target population should be reported. It may also be important to describe patients who met the eligibility criteria but chose not to participate in the trial, when this information is available. The need for this might arise when patients from a large group are asked to participate, but many refuse. As pointed out previously, it may be difficult to generalize from these situations. For nonrandomized designs, even detailed descriptions of the study group may not provide a convincing basis on which to make comparisons with other studies. Thus, comparison is not the motivation, but thoroughness is.
Treatment and Eligibility Failures
As mentioned previously, it is acceptable to perform statistical analyses on only the subset of eligible patients, even when eligibility is corrected in retrospect. This does not create bias in the estimate of relative effects within the trial. Investigators should report those patients who were retrospectively found to have failed the eligibility criteria as well as those patients who failed to complete the assigned treatment.
There are situations in which a large fraction of patients complete the assigned therapy but may receive additional therapy not specified by the protocol or design of the trial. For example, patients with esophageal cancer may undergo resection and chemotherapy, and have a variety of secondline treatments if signs of disease progression or recurrence are observed. If some of these latter treatments are active, the results of an initial treatment comparison based on recurrence or survival may be skewed. In fact, in general, it is difficult or impossible to use the statistical information in studies that permit “crossovers” either to new treatments or to the other treatment arm. One cannot exclude these patients but can use only the information up to the time of stopping the assigned treatment.
Statistical Methods and Assumptions
Readers should be made aware of any assumptions made in both the design and analysis of a clinical trial. For a discussion of some practical issues, see DerSimonian and associates.^{[27]} The assumptions and limitations of many common statistical procedures are well understood by clinicians. However, the readers of clinical trial reports should be convinced that the data analyst has verified all important assumptions and reported the methods in detail for less wellknown statistical procedures. Examples of assumptions that are often made in analysis, often violated by the data, and also likely to be consequential are distributional assumptions underlying the ttest or other statistical hypothesis tests, error distributions in linear regression analyses, and proportionality of hazards in life table regressions. For example, the ttest assumes that the distributions being compared are normal with equal variances. It can yield incorrect results when either of these assumptions is false, particularly if distributions are not symmetric. Proportional hazards regression models most often assume that the effects of predictors is to multiply a baseline risk and that the multiplicative factor is constant over time. Although the model is robust to departures from this assumption—that is, it will often yield the correct estimates of relative risk and significance levels anyway—it is helpful to validate the assumptions.
Univariate Analyses
It is likely that the data analyst will test the effect of all potentially important prognostic variables on the major outcomes. For these univariate analyses investigators should report estimated treatment effects (odds ratios or hazard ratios), confidence intervals, and significance levels of tests of no treatment effect (P values). This does not preclude presenting other displays of univariate analyses (e.g., survival curves or 2 × 2 tables) if these analyses are especially relevant. However, the investigators should keep in mind that univariate analyses, particularly in uncontrolled studies, are subject to confounding. Consequently, these analyses should probably not be emphasized or presented in excessive detail.
Adjusted Analyses
In a randomized trial, the univariate comparison of treatment groups is a simple and valid summary. However, many investigators attempt to show that the treatment effect is not due to any measured confounders by using adjusted analyses. The best style of reporting multivariate analyses is the same or similar to that for univariate effects. However, the adjusted analyses reported are usually selected from a larger set of less informative or preliminary results. As an example, consider a life table regression model attempting to predict time to cancer recurrence. The “best” (most predictive but parsimonious) model might be built using a stepdown procedure from a large set of potential prognostic factors. Each step in the analysis need not be reported, but the final model is a major objective of the analysis.
For multiple regression analyses, investigators usually report adjusted estimates of treatment effects, confidence intervals, and P values. Not all prognostic factors retained in multiple regression models must be “statistically significant.” It is often useful to keep nonsignificant effects in a multiple regression model to demonstrate convincingly that the treatment effect persists in their presence.
Negative Findings
When no statistically significant treatment effect or difference is found, the power of the study is sometimes called into question. However, the absence of a significant difference is not the same as evidence of no effect. Because clinical effects are measured by risk ratios rather than P values, the preceding guidelines emphasizing estimated treatment differences rather than hypothesis tests are important. Helpful advice regarding negative clinical trials is provided by Detsky and Sackett.^{ [64] [65]} Power calculations performed after the study is completed are rarely, if ever, helpful.
Effect of Patient Exclusions
Although we have emphasized the value of the intenttotreat principle and related analyses, in practice, many exploratory analyses will be done. Investigators should report any differences between intentiontotreat analyses and eligible patient analyses. If subset analyses are performed, discrepancies between these and the major analyses of the clinical trial should be reported.
What Is Significant?
The P value should not be the only criterion for “significance.” Results with strong biologic or clinical justification and P values near 0.05 are “statistically significant.” When biologic justification is strong, effect estimates are large, and confidence intervals or P values indicate significance near conventional levels, it seems appropriate to label these results as “statistically significant.” Conversely, results with no biologic or clinical justification or those which seem paradoxical should be reported and interpreted with caution, even when P values are smaller than 0.05. There is no way to separate type I errors from truly significant results, except to rely on additional evidence and biologic rationale. It is wise to report cautiously results that seem not to make sense.
Exploratory Analyses
Exploratory or hypothesisgenerating analyses should be only informally reported. They should not be emphasized as the primary findings of a clinical trial unless supported by the design and a priori hypothesis.
BIOINFORMATICS AND CLINICAL TRIALS
Bioinformatics, a yet developing field, focuses on the interaction between computer science and biology. Medical informatics, an established field, focuses on the interaction between computer science and clinical medicine. Although these are separate fields, the emerging need to correlate genotypic with phenotypic information has created a need to bring both disciplines together to support genomic medicine. The large amounts of genetic, genomic, and proteomic data offer opportunities for new research targets, with novel therapies for established diseases being developed. These innovative approaches, however, cannot be sustained without effectively dealing with the vast amounts of data generated in the laboratory. Equally important is the integration of clinical data generated by medical records with genomic information. Biomedical informatics is an emerging discipline that aims to create a common conceptual information space in which to further the discovery of novel diagnostic and therapeutic methods in support of genomic medicine.
Genomic technologies will revolutionize drug discovery and development; that much is universally agreed upon. The high dimension of data from such technologies has challenged available dataanalytic methods; that much is apparent. Researchers have been hard at work marrying the latest technologies with drug design to more closely examine whether a drug is having any biologic impact (clinical trial), what the effect is (measure of endpoint), when it stops working (onset of resistance), and what can be done about it (alternate therapies). Gene therapy, cancerkilling viruses, and new drugs highlight a few of the novel approaches to cancer treatment. The emergence of promising new molecular targeted agents and new technologies for screening and early detection through the application of bioinformatics has prompted new opportunities in biostatistics for clinical trial designs and analysis that integrate therapy and prevention endpoints.
Cancer has long been understood to be a genetic disorder, and the growing sophistication of genetic resources and tools are contributing greatly to the fight against it. Well before the Human Genome Project, classic cytogenetics revealed gross deletions, amplifications, and rearrangements in cancer cells, which changes can now be analyzed at basepair resolution with technologies including singlenucleotide polymorphisms and comparative genomic hybridizationbased genomic arrays. Expression studies that were done one gene at a time are now conducted across the entire transcriptome with 1.5 million feature expression arrays, and can be used to evaluate the changes wrought by epigenetic therapeutics such as histone deacetylases. Genes that are involved in cancer and carcinogenesis include tumor suppressor genes and oncogenes, often found in regions that are respectively deleted or amplified. All are important to our understanding of cancers, and many may be targets for prevention or cure. A therapy target is a gene whose expression can be changed by treatment. Tumor suppressor genes make good targets as some that have been turned off by epigenetic silencing can be turned on with treatment.
All of these advances provide a wealth of data and potential therapeutic targets that will require evaluation through clinical trials; this, in turn, has prompted a reevaluation of set standards and procedures, such as those discussed in this chapter. What is needed is not so much ways to analyze expression per se but rather effective ways to channel the new technologies’ flows of copynumber (deletion and amplification regions) and expression (treated/untreated foldchange) data into a manageable stream of candidate treatments and genes, and at the same time, new methods to appropriately evaluate the treatments’ outcomes for the targets they are intended. In this section we highlight two previously discussed basic concepts, endpoint definition and monitoring, within the context of the impact of the genomics era on them and in particular, cancer vaccines as they represent a premier example of the translation of genomic information from bench to bedside.
DESIGN
Conventional oncology drug development paradigm is to investigate four phases, discussed previously in this chapter: phase I to determine safety and dose, phase II to evaluate effectiveness and look for side effects, phase III to verify effectiveness, and phase IV for postmarketing surveillance. With cancer vaccines, for several reasons, there is no clear delineation of such phases, calling for the development of a different paradigm. First, there are typically little to no serious toxicity risks and no proof for a linear doseresponse relationship and thus no need for conventional dose escalation to establish the maximum tolerable dose. Dose and schedule are not determined through escalation based on toxicity. Cancer vaccines are not metabolized, and thus, there is no need for conventional pharmacokinetics. Many cancer vaccines are designed to address one tumor type, and thus, there is no need for mixed tumor trials for target selection. Conventional shortterm response criteria (e.g., response criteria in solid tumors, RECIST) are not efficiently applicable to cancer vaccines (discussed in detail later), and historical control comparisons on response rate are not useful, because proofofprinciple endpoints should reflect biologic activity, including immunogenicity. There have been discussion and working groups formed to instead propose the development of a proofofprinciple trial followed by an efficacy trial, with proposed endpoints based on evidence of “signal of activity” of the vaccine.
Clinical Endpoint
In general, an endpoint is a measure to determine whether a therapy is working or not. In cancer vaccines the general goal is to develop a therapy that targets cancer cells, and thus, an examination of tumor response (as opposed to patient response) to therapy seems quite reasonable. Unlike other vaccines and passive therapeutic modalities (e.g., chemotherapeutic agents and radiation therapy), targeted therapies, including cancer vaccines and other immunobased treatments, initiate a dynamic process of activating the host's (patient's) own immune system; as such, there is potential for patient response to the therapy under consideration, as well as postvaccination therapies.
Ongoing therapeutic cancer vaccine trials have yet to show evidence of vaccines initiating a patient's immune system to shrink tumors, yet patients who receive these vaccines tend to live longer and respond better to subsequent treatment, prompting a question of whether we are looking at cancer vaccine trials the wrong way. Are we appropriately measuring “response to therapy”? To address this question, it is important to understand the process underlying therapeutic cancer vaccines.
Unlike preventative vaccines, such as those designed to protect against the flu, cancer vaccines are administered to treat an existing condition or disease. Such vaccines fall under one of two general types: (1) cellbased, created using cells from the patient's own immune system that have been activated to the presence of cancer antigens and delivered back to the patient along with additional proteins that facilitate immune activation, or (2) vectorbased, wherein an engineered virus (vector) is used to introduce cancer proteins and other molecules to stimulate the immune system. In either case both approaches are designed to mount and prepare the patient's immune system into attacking existing tumor cells.
In their review, Schlom and coworkers^{[66]} examine two cellbased vaccines, SipuleucelT (Provenge) and GVAX, in addition to three trials using an engineered poxvirus vector. Although their review article focuses on prostate cancer vaccines, the researchers consider these trials as examples of ongoing progress in similar vaccine therapies for lymphoma, melanoma, pancreatic, lung, and other types of cancer. According to their review of five prostate cancer vaccine trials, Schlom and colleagues^{[66]} offer evidence that patients who receive vaccines may respond better to subsequent chemotherapy or hormone treatment, leading to improved patient survival. However, the endpoints of these trials were not longterm survival but a reduction in tumor size. With this endpoint, such vaccines may be deemed ineffective and abandoned, because the primary endpoint (tumor size reduction) was not achieved, despite their real therapeutic value in prolonging patient survival. The data prompt the rethinking of clinical vaccine trial design and in particular, the current approach to measuring cancer vaccine effectiveness. In this case, it may seem more reasonable to think of the effectiveness of a therapeutic vaccine in terms of the response of the patient, rather than the response of the tumor. Although RECIST standards work well to evaluate therapies that are toxic to tumors, such as radiation therapy or chemotherapy, they are less capable of measuring more subtle systemic effects of immune response. With this in mind, patient response to therapy may be longterm, in which case other markers as surrogates for patient response to a vaccine's “signal of activity” may be considered. In particular, the pursuit of molecular biomarkers and their appropriate use as surrogate endpoints in clinical trials is emerging with advances in cell biology, genetics, microbiology, and other fields.
Surrogate Endpoint
The demand for new and improved biomarkers is a reflection of the emerging drug development due in part to genomic advances that led to a better understanding of the disease processes. One example of such progress in measuring biomarkers to guide therapeutic development is the potential use of human immunodeficiency virus plasma RNA load (viral load) in AIDS. There are many new ways to discover drugs in light of gene target discoveries. The use of molecular targets to design new chemical compounds has provided many new candidates for testing, resulting in a pressing need for more efficient ways to design trials. The identification of highly precise and accurate biomarkers could allow the testing of more candidate drugs, reduce the number of patients required to conduct trials, expand on our capacity to predict adverse events, and potentially improve regulatory decision making. In oncology in particular, we are entering an era of sophistication in making more precise diagnoses and more informed choices about therapy. Biomarkers of immune system function and viral replication, such as CD4 cell counts and human immunodeficiency virus viral load, have helped in the initial evaluation of new AIDS therapies, which have subsequently been demonstrated to extend life and improve its quality.
To understand the link between a biomarker, clinical endpoint, and surrogate endpoint, we must first clarify their definitions. A biomarker (or biologic marker) is a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention. A clinical endpoint is a characteristic that reflects patients’ responses (e.g., how they feel, function, or survive). These endpoints are distinct measures of disease characteristics that reflect the effect of a therapeutic intervention. A surrogate endpoint, in brief, is a biomarker intended to substitute for a clinical endpoint. Although all surrogate endpoints are biomarkers, not all biomarkers are surrogate endpoints, and in fact, only a very few biomarkers may be considered for use as surrogate endpoints. For a biomarker to be considered as a surrogate endpoint, it is required to predict clinical benefit based on epidemiologic, therapeutic, pathophysiologic or other scientific evidence. Additionally, the utility of a biomarker as a surrogate endpoint requires demonstration of its accuracy (the correlation of the measure with the clinical endpoint) and precision (the reproducibility of the measure).
One approach to establish the link between a biomarker and clinical endpoint is to estimate the proportion of treatment effect that is accounted for by the surrogate endpoint for which there are several ways to make this determination.^{[67]} Strictly speaking, if a surrogate endpoint is to be valid as a clinical endpoint, the biomarker must be able to account for all of the effects of the intervention on the clinical outcome (endpoint). In practice, however, it may be too much to ask of a single biomarker to fully capture all of a treatment's effect. To this end, the use of multiple biomarkers representing various components of complex disease pathways may yield surrogate endpoints that are more comprehensive in the ability to assess effects of therapeutic interventions.
A timely example of such a need for multiple biomarkers lies in cancer vaccine trials, where “signal of activity” of a vaccine may be defined in terms of three characteristics: clinical response, biologic activity, or immune response. For biologic activity, potential measures may include regulatory Tcell activity or immune response against target cells, or molecular response (MRD). An immune profile may be assessed by sequential samples collected over time points (e.g., baseline, followup visits) to assess reproducibility of assay results. As for clinical activity, there is no current mandate to demonstrate clinical activity with conventional oncology endpoints in proofofprinciple trials, and thus, typically no endstage patients, so a homogeneous population is selected. As for surrogate endpoints in trials with cancer vaccines, molecular response is being considered. Cancer vaccines are expected to work best in MRD populations. Molecular markers that allow uniform assessment of MRD and the impact of a vaccine on the target disease may function as a measure of biologic and/or clinical activity. Some examples include CML, with a welldefined chromosomal abnormality (BCRABL) that is detectable by reverse transcriptase–polymerase chain reaction (RTPCR), and acute myeloid leukemia, wherein multiple heterogeneous chromosomal abnormalities are not present in all patients, requiring an array of markers to determine biologic activity.
Monitoring
To control patients’ responses to therapies and maintain proper doses, monitoring is necessary. The advent of the genomics era has prompted the potential use of molecular endpoints for monitoring patients’ responses to therapy. To gain a better perspective on this issue, in this section, we revisit concepts introduced earlier in this chapter for monitoring within the context of CML.
Historically, studies of CML therapies have used cytogenetic responses, the different phases of CML, remission, and even death to monitor treatment response and ascertain treatment efficacy. These endpoints carry the greatest weight in clinical practice, because observed differences in them signify tangible differences in treatment benefit. The effectiveness of molecular targeted therapy, such as tyrosine kinase inhibitors (TKIs) at helping patients with CML to achieve a MRD but not eradicating the disease, has prompted consideration of combination therapies in treating CML. Thus, as therapy for CML improved, the low rate of disease progression has made it impractical to use clinical events as primary endpoints in trials of short duration. On the other hand, the ability to measure the amount (number of transcripts) of the hallmark gene in CML patients, bcrabl, has made it possible to compare expression with received treatment, which is a more sensitive marker for therapies than cytogenetic response, and is able to detect MRD. Almost all CML therapy trials now incorporate this molecular assay as part of monitoring treatment response, yet bcrabl as a marker for some clinical event is not accepted by the US Food and Drug Administration (FDA) for product registration for agents other than TKIs, with no formal guidelines on its use.
Although quantitative PCR testing of bcrabl in patients with CML has become the predominant molecular monitoring technique for CML therapy, with correlations established with the probability of relapse, it is regarded as a risky endpoint from a regulatory perspective, with no absolute guidelines for monitoring. The difficulty with using changes in bcrabl transcripts as a marker to evaluate treatment response is that the assay itself has issues of lower detection limits and large variability. These issues are particularly seen in CML patients with MRD, an everincreasing patient pool, due to the effectiveness of TKIs. Devising a monitoring strategy for CML patients is thus a challenge ideally suited for a comprehensive statistical approach, requiring the development of models fully informed by current biomedical knowledge, efficient inference that extracts maximal information, and a design that combines the population with individual data.
A marker endpoint is considered a good surrogate for a clinical endpoint if treatment effects on the marker reliably predict treatment effects on the clinical endpoint. For this condition to hold, the marker (1) must be correlated with the clinical outcome and (2) must fully capture the effect of the treatment on the clinical endpoint.
Whereas most candidate markers typically adhere to the first condition, the second is more stringent, resulting in many markers as partial mediators of treatment response. Data now exist to support the importance of achieving molecular milestones when treating patients with TKIs. Newly diagnosed CML patients treated with the particular TKI, Gleevec, who achieved a 3 log reduction in bcrabltranscripts at 12 months following its initiation, have been shown to improve in their progressionfree survival as compared with patients who did not reach a 3 log decrease. This and related observations have led many investigators to use endpoints based on bcrabl testing, although their surrogacy for a clinical endpoint has not been fully examined and validated. Within the context of understanding the significance of improving on Gleevec's clinical success with new combinations, there are two critical issues that must be broached. The first is the determination of whether additional reductions in PCR levels are of clinical importance (e.g., is a 3 log reduction as informative as a 4 log reduction or undetectable values?) and to translate such reductions into additional progressionfree survival. The second requires ongoing modifications and enhancements of the molecular assay measuring bcrabl. Disease burden measures at or near the assay's limit of detection seem to have large sampletosample variability. A criterion of “undetectable” as an endpoint can be confounded by sample handling, because undetectable values can be due to lowquality RNA or poor RNA yield and further, the limit of detection is sample and runspecific.
The ability of a trial to answer the posed clinical question depends on whether the marker endpoint is indeed a surrogate for the clinical outcome of interest. There is a clear need for guidelines on the use ofbcrabl transcript changes as a molecular endpoint to monitor and assess treatment response. Molecular techniques will play a large role in monitoring the progress of CML and in reassessing therapeutic strategies. Many groups are at work to create standards that can be used to crossvalidate results from individual laboratories, with their focus on creating a standardized assay that is universally accepted and used. Alternative strategies have been to create a series of statistical models whose results will be used to devise a monitoring strategy for implementation to achieve standardization of results. The potential use of molecular endpoints for monitoring of patients’ response to therapy will have tremendous impact, especially with the advent of new efficient drugs and combination therapies, because results from clinical trials will be obtained faster than with a clinical endpoint, and in turn, clinicians will be able to adapt the new treatment faster.
Future Directions
The advances made in cancer biology have provided us with a better fundamental understanding of cancer but with yet slow progress in translating such knowledge into medical practice. As a result, the American Association for Cancer Research (AACR) together with the FDA and National Cancer Institute (NCI) have formed the AACRFDANCI cancer biomarkers collaborative to facilitate the use of valid biomarkers in clinical trials and ultimately in evidencebased oncology and cancer medicine. Research is under way to find new ways of exploring the use of biomarkers in cancer detection and treatment that will, in turn, require a new generation of clinical trials that modernize the processes and methods used to evaluate safety and efficacy for evaluation of novel, genebased therapies without sacrificing high standards.
SUMMARY AND CONCLUSIONS
In developing new cancer treatments, investigators are often interested in treatment effects and differences that are about the same size as the variability or the bias that is a part of all clinical studies. The only solution for making valid inferences in the face of these potential errors is to properly design, conduct, and analyze clinical trials. There are a small number of important design considerations to help control bias and random errors including the use of randomization, blinding, stratification, minimizing postentry exclusions, adequate sample size, and planned interim monitoring.
Clinical trials have limitations, partly because of the rigor required to implement them. Investigators contemplating the use of these important scientific tools should focus most efforts on the design aspects of the study and concern themselves little with analysis. This is because most of the serious errors that can be made when performing clinical trials can be prevented or minimized by correct design. In this regard, consultation with an experienced clinical trial methodologist early in the design stage of an investigation will be of enormous benefit.
When analyzing and reporting the results of clinical trials, investigators should follow a simple approach. The purpose of a trial is to estimate an effect or treatment difference, which if present would have clinical utility when treating new patients. Procedures or methods that do not facilitate estimating and reporting the treatment effect with precision and without bias are likely to mislead investiga tors. Often in clinical trials, investigators are interested in estimates of odds or hazard ratios between treatment groups.
These ideas suggest that the most useful results from clinical trials will be estimated risk ratios and their confidence limits. Especially in oncology studies, where disease progression, recurrence, and death are of interest, estimates of risk difference are very relevant. Hypothesis tests and associated P values, though often (or exclusively) reported, are of lesser utility because they do not fully summarize the data. These recommendations are similar to those in many journals.
Despite some technical disagreement among statisticians regarding the need for adjusted analyses for imbalanced prognostic factors, we believe that it is wise to see if treatment effects change after accounting for imbalances. When this occurs, it seems likely that it will be of clinical interest. Although we discourage analyses that exclude any patients who meet the eligibility criteria, some circumstances will require that this be done (e.g., when a patient refuses to participate after randomization). Investigators should report, and emphasize as primary, those analyses that include all eligible patients.
REFERENCES