KEY CONCEPTS

PRESENTING PROBLEMS
Presenting Problem 1
In Chapter 8 we examined the study by Jackson and colleagues (2002) who evaluated the relationship between BMI and percent body fat. Please refer to that chapter for more details on the study. We found a significant relationship between these two measures and calculated a correlation coefficient of r = 0.73. These investigators knew, however, that variables other than BMI may also affect the relationship between BMI and percent body fat and developed separate models for men and women. We use their data in this chapter to illustrate two important procedures: multiple regression to control possible confounding variables, and polynomial regression to model the nonlinear relationship we noted in Chapter 8. Data are on the CDROM in a file entitled “Jackson.”
Presenting Problem 2
Soderstrom and coinvestigators (1997) wanted to develop a model to identify trauma patients who are likely to have a blood alcohol concentration (BAC) in excess of 50 mg/dL. They evaluated data from a clinical trauma registry and toxicology database at a level I trauma center. Such patients might be candidates for alcohol and drug abuse and dependence treatment and intervention programs.
Data, including BAC, were available on 11,062 patients of whom approximately 71% were male and 65% were white. The mean age was 35 years with a standard deviation of 17 years. Type of injury was classified as unintentional, typically accidental (78.2%), or intentional, including suicide attempts (21.8%). Of these patients, 3180 (28.7%) had alcohol detected in the blood, and 91.2% of those patients had a BAC in excess of 50 mg/dL. Among the patients with a BAC > 50, percentages of men and whites did not differ appreciably from the entire sample; however, the percentage of intentional injuries in this group was higher (28.9%). We use a random sample of data provided by the investigators to illustrate the calculation and interpretation of the logistic model, the statistical method they used to develop their predictive model. Data are in a file called “Soderstrom” on the CDROM.
Presenting Problem 3
In the previous chapter we used data from a study by Crook and colleagues (1997) to illustrate the Kaplan–Meier survival analysis method. These investigators studied the correlation between both the pretreatment prostatespecific antigen (PSA) and posttreatment nadir PSA levels in men with localized prostate cancer who were treated using external beam radiation therapy. The Gleason histologic scoring system was used to classify tumors on a scale of 2 to 10. Please refer to that Chapter 9 for more details. The investigators wanted to examine factors other than tumor stage that might be associated with treatment failure, and we use observations from their study to describe an application of the Cox proportional hazard model. Data on the patients are given in the file entitled “Crook” on the CDROM.
Presenting Problem 4
The use of central venous catheters to administer parenteral nutrition, fluids, or drugs is a common medical practice. Catheterrelated bloodstream infections (CRBSI) are a serious complication estimated to occur in about 200,000 patients each year. Many studies have suggested that impregnation of the catheter with the antiseptic chlorhexidine/silver sulfadiazine reduces bacterial colonization, but only one study has shown a significant reduction in the incidence of bloodstream infections.
It is difficult for physicians to interpret the literature when studies report conflicting results about the benefits of a clinical intervention or practice. As you now know, studies frequently fail to find significance because of low power associated with small sample sizes. Traditionally, conflicting results in medicine are dealt with by reviewing many studies published in the literature and summarizing their strengths and weaknesses in what are commonly called review articles. Veenstra and colleagues (1999) used a more structured method to combine the results of several studies in a statistical manner. They applied metaanalysis to 11 randomized, controlled clinical trials, comparing the incidence of bloodstream infection in impregnated catheters versus nonimpregnated catheters, so that overall conclusions regarding efficacy of the practice could be drawn. The section titled “MetaAnalysis” summarizes the results.
PURPOSE OF THE CHAPTER
The purpose of this chapter is to present a conceptual framework that applies to almost all the statistical procedures discussed so far in this text. We also describe some of the more advanced techniques used in medicine.
A Conceptual Framework
The previous chapters illustrated statistical techniques that are appropriate when the number of observations on each subject in a study is limited. For example, a t test is used when two groups of subjects are studied and the measure of interest is a single numerical variable—such as in Presenting Problem 1 in Chapter 6, which discussed differences in pulse oximetry in patients who did and did not have a pulmonary embolism (Kline et al, 2002). When the outcome of interest is nominal, the chisquare test can be used—such as the Lapidus et al (2002) study of screening for domestic violence in the emergency department (Chapter 6 Presenting Problem 3). Regression analysis is used to predict one numerical measure from another, such as in the study predicting insulin sensitivity in hyperthyroid women (Gonzalo et al, 1996; Chapter 7 Presenting Problem 2)
Alternatively, each of these examples can be viewed conceptually as involving a set of subjects with two observations on each subject: (1) for the t test, one numerical variable, pulse oximetry, and one nominal (or group membership) variable, development of pulmonary embolism; (2) for the chisquare test, two nominal variables, training in domestic violence and screening in the emergency department; (3) for regression, two numerical variables, insulin sensitivity and body mass index. It is advantageous to look at research questions from this perspective because ideas are analogous to situations in which many variables are included in a study.
To practice viewing research questions from a conceptual perspective, let us reconsider Presenting Problem 1 in Chapter 7 by Woeber (2002). The objective was to determine whether differences exist in serum free T_{4} concentrations in patients who had thyroiditis with normal serum TSH values and not taking LT_{4} replacement, had normal TSH values and were taking LT_{4} replacement therapy, or had normal thyroid and serum TSH levels. The research question in this study may be viewed as involving a set of subjects with two observations per subject: one numerical variable, serum free T_{4} concentrations, and one ordinal (or group membership) variable, thyroid status, with three categories. If only two categories were included for thyroid status, the t test would be used. With more than two groups, however, oneway analysis of variance (ANOVA) is appropriate.
Many problems in medicine have more than two observations per subject because of the complexity involved in studying disease in humans. In fact, many of the presenting problems used in this text have multiple observations, although we chose to simplify the problems by examining only selected variables. One method involving more than two observations per subject has already been discussed: twoway ANOVA. Recall that in Presenting Problem 2 in Chapter 7 insulin sensitivity was examined in overweight and normal weight women with and without hyperthyroid disease (Gonzalo et al, 1996). For this analysis, the investigators classified women according to two nominal variables (weight status and thyroid status, both measured as normal or higher than normal) and one numerical variable, insulin sensitivity. (Although both weight and thyroid level are actually numerical measures, the investigators transformed them into nominal variables by dividing the values into two categories.)
If the term independent variable is used to designate the group membership variables (eg, development of pulmonary embolism or not), or the X variable (eg, blood pressure measured by a finger device), and the term dependent is used to designate the variables whose means are compared (eg, pulse oximetry), or the Y variable (eg, blood pressure measured by the cuff device), the observations can be summarized as in Table 101. (For the sake of simplicity, this summary omits ordinal variables; variables measured on an ordinal scale are often treated as if they are nominal.) Data from several of the presenting problems are available on the CDROM, and we invite you to replicate the analyses as you go through this chapter.
Table 101. Summary of conceptual framework^{a} for questions involving two variables. 


Introduction to Methods for Multiple Variables
Statistical techniques involving multiple variables are used increasingly in medical research, and several of them are illustrated in this chapter. The multipleregression model, in which several independent variables are used to explain or predict the values of a single numerical response, is presented first, partly because it is a natural extension of the regression model for one independent variable illustrated in Chapter 8. More importantly, however, all the other advanced methods except metaanalysis can be viewed as modifications or extensions of the multipleregression model. All except metaanalysis involve more than two observations per subject and are concerned with explanation or prediction
The goal in this chapter is to present the logic of the different methods listed in Table 102 and to illustrate how they are used and interpreted in medical research. These methods are generally not mentioned in traditional introductory texts, and most people who take statistics courses do not learn about them until their third or fourth course. These methods are being used more frequently in medicine, however, partly because of the increased involvement of statisticians in medical research and partly because of the availability of complex statistical computer programs. In truth, few of these methods would be used very much in any field were it not for computers because of the timeconsuming and complicated computations involved. To read the literature with confidence, especially studies designed to identify prognostic or risk factors, a reasonable acquaintance with the methods described in this chapter is required. Few of the available elementary books discuss multivariate methods. One that is directed toward statisticians is nevertheless quite readable (Chatfield, 1995); Katz (1999) is intended for readers of the medical literature and contains explanations of many of topics we discuss in this chapter (Dawson, 2000), as does Norman and Streiner (1996).
Before we examine the advanced methods, however, a comment on terminology is necessary. Some statisticians reserve the term “multivariate” to refer to situations that involve more than one dependent (or response) variable. By this strict definition, multiple regression and most of the other methods discussed in this chapter would not be classified as multivariate techniques. Other statisticians, ourselves included, use the term to refer to methods that examine the simultaneous effect of multiple independent variables. By this definition, all the techniques discussed in this chapter (with the possible exception of some metaanalyses) are classified as multivariate.
MULTIPLE REGRESSION
Review of Regression
Simple linear regression (Chapter 8) is the method of choice when the research question is to predict the value of a response (dependent) variable, denoted Y, from an explanatory (independent) variable X. The regression model is
For simplicity of notation in this chapter we use Y to denote the dependent variable, even though Y′, the predicted value, is actually given by this equation. We also use a and b, the sample estimates, instead of the population parameters, β_{0} and β_{1}, where a is the intercept and bthe regression coefficient. Please refer to Chapter 8 if you'd like to review simple linear regression.
Multiple Regression
The extension of simple regression to two or more independent variables is straightforward. For example, if four independent variables are being studied, the multiple regression model is
where X_{1} is the first independent variable and b_{1} is the regression coefficient associated with it, X_{2} is the second independent variable andb_{2} is the regression coefficient associated with it, and so on. This arithmetic equation is called a linear combination; thus, the response variable Y can be expressed as a (linear) combination of the explanatory variables. Note that a linear combination is really just a weighted average that gives a single number (or index) after the X's are multiplied by their associated b's and the bX products are added. The formulas for a and b were given in Chapter 8, but we do not give the formulas in multiple regression because they become more complex as the number of independent variables increases; and no one calculates them by hand, in any case
Table 102. Summary of conceptual framework^{a} for questions involving two or more independent (explanatory) variables. 


The dependent variable Y must be a numerical measure. The traditional multipleregression model calls for the independent variables to be numerical measures as well; however, nominal independent variables may be used, as discussed in the next section. To summarize, the appropriate technique for numerical independent variables and a single numerical dependent variable is the multiple regression model, as indicated in Table 102.
Multiple regression can be difficult to interpret, and the results may not be replicable if the independent variables are highly correlated with each other. In the extreme situation, two variables that are perfectly correlated are said to be collinear. When multicollinearity occurs, the variances of the regression coefficients are large so the observed value may be far from the true value. Ridge regression is a technique for analyzing multiple regression data that suffer from multicollinearity by reducing the size of standard errors. It is hoped that the net effect will be to give more reliable estimates. Another regression technique, principal components regression, is also available, but ridge regression is the more popular of the two methods.
Interpreting the Multiple Regression Equation
Jackson and colleagues (2002) (Presenting Problem 1) wanted to study the way in which sex, age, and race affect the relationship between BMI and percent body fat. We provide some basic information on these variables in Table 103 and see the study included 121 black females, 238 white females, 81 black men, and 215 white men
Table 104 shows the regression equation to predict percent body fat (see the shaded values). Focusing initially on the Regression Equation Section, we see that all the variables are statistically significantly related to percent body fat.
Table 103. Means and standard deviations broken down by gender and race. 


The first variable is a numerical variable, age, with regression coefficient, b, of 0.1603, indicating that greater age is associated with higher percent body fat. The second variable, BMI, is also numerical; the regression coefficient of 1.3710 indicates that patients with higher BMI also have higher percent body fat, which certainly makes sense.
The third variable, sex, is a binary variable having two values. For regression models it is convenient to code binary variables as 0 and 1; in the Jackson example, females have a 0 code for sex, and males have a 1. This procedure, called dummy or indicator coding, allows investigators to include nominal variables in a regression equation in a straightforward manner. The dummy variables are interpreted as follows: A subject who is male has the code for males, 1, multiplied by the regression coefficient for sex, 1.3710, resulting in an additional 1.3710 points being added to his percent body fat. The decision of which value is assigned 1 and which is assigned 0 is an arbitrary decision made by the researcher but can be chosen to facilitate interpretations of interest to the researcher.
The final variable is race, also dummy coded, with 0 for black and 1 for white. The regression coefficient is negative and indicates that white patients have 0.9161 subtracted from their percent body fat. The intercept itself is 8.3748, meaning that the predicted percent body fat is reduced by this amount after including all variables in the equation. The regression coefficients can be used to predict percent body fat by multiplying a given patient's value for each independent variable X by the corresponding regression coefficient b and then summing to obtain the predicted percent body fat.
Table 104. Multiple regression predicting percent body weight. 


Regression coefficients are interpreted differently in multiple regression than in simple regression. In simple regression, the regression coefficient b indicates the amount the predicted value of Y changes each time X increases by 1 unit. In multiple regression, a given regression coefficient indicates how much the predicted value of Y changes each time X increases by 1 unit,holding the values of all other variables in the regression equation constant—as though all subjects had the same value on the other variables. For example, predicted percent body fat is increased by 0.1603 for increase of 1 year in patient, assuming all other variables are held constant. This feature of multiple regression makes it an ideal method to control for baseline differences and confounding variables, as we discuss in the section titled “Controlling for Confounding.”
It bears repeating that multiple regression measures only the linear relationship between the independent variables and the dependent variable, just as in simple regression. In the Jackson study, the authors examined the scatterplot between BMI and percent body fat, which we have reproduced in Figure 101. The figure indicates a curvilinear relationship, and investigators decided to transform BMI by taking its natural logarithm. They developed four models for females and males separately to examine the cumulative effect of including variables in the regression equation; results are reproduced in Table 105. Model I includes only ln BMI and the intercept; model II adds in the age, model III the race, and model IV interactions between ln BMI with race and age. The rationale for including interactions is the same as discussed in Chapter 7, namely that they wanted to know whether the relationship between ln BMI and percent body weight was the same for all levels of race or age.
Figure 101. Plot illustrating the nonlinear relationship between BMI and percent body fat. (Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age, and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002; 26: 789–796. Analysis produced using NCSS; used with permission.) 

Table 105. Results from the regression analyses predicting percent body weight. 


Statistical Tests for the Regression Coefficient
Table 106 shows the output from NCSS for model III for female subjects; it contains a number of features to discuss. In the upper half of the table, note the columns headed by t value and probability level. Both the t test and the F test can be used to determine whether a regression coefficient is different from zero, or the t distribution can be used to form confidence intervals for each regression coefficient. Remember that even though the P values are sometimes reported as 0.000, there is always some probability, even if it is very small. Many statisticians believe, and we agree, that it is more accurate to report P < 0.001.
Standardized Regression Coefficients
Most authors present regression coefficients that can be used with individual subjects to obtain predicted Y values. But the size of the regression coefficients cannot be used to decide which independent variables are the most important, because their size is also related to the scale on which the variables are measured, just as in simple regression. For example, in Jackson and colleagues' study, the variable race was coded 1 if white and 0 if black, and the variable age was coded as the number of years of age at the time of the first data collection. Then, if race and age are equally important in predicting subsequent depression, the regression coefficient for race would be much larger than the regression coefficient for age so that the same amount would be added to the prediction of percent body weight. These regression coefficients are sometimes called unstandardized; they cannot be used to draw conclusions about the importance of the variable, but only whether the relationship or with the dependent variable Y is positive or negative.a One way to eliminate the effect of scale is to standardize the regression coefficients. Standardization can be done by subtracting the mean value of X and dividing by the standard deviation before analysis, so that all variables have a mean of 0 and a standard deviation of 1. Then it is possible to compare the magnitudes of the regression coefficients and draw conclusions about which explanatory variables play an important role. It is also possible to calculate the standardized regression coefficients after the regression model has been developed.b The larger the standardized coefficient, the larger the value of the t statistic. Standardized regression coefficients are often referred to as beta (β) coefficients. The major disadvantage of standardized regression coefficients is that they cannot readily be used to predict outcome values. The lower half of Table 106 contains the standardized regression coefficients in the far right column for the variables used to percent body fat in Jackson and colleagues' study. Using the standardized coefficients in Table 106, can you determine which variable, age or race, has more influence in predicting subsequent depression? If you chose age, you are correct, because the absolute value of its standardized coefficient is larger, 0.1981, compared with 0.0777 for race.
Table 106. Regression analysis of females, model III. 


Multiple R
Multiple R is the multipleregression analogue of the Pearson product moment correlation coefficient r. It is also called the coefficient of multiple determination, but most authors use the shorter term. As an example, suppose percent body fat is calculated for each person in the study by Jackson and colleagues; then, the correlation between predicted percent body fat and the actual percent body fat is calculated. This correlation is the multiple R. If the multiple R is squared (R^{2}), it measures how much of the variation in the actual depression score is accounted for by knowing the information included in the regression equation. The term R^{2} is interpreted in exactly the same way as r^{2} in simple correlation and regression, with 0 indicating no variance accounted for and 1.00 indicating 100% of the variance accounted for. Recall that in simple regression, the correlation between the actual value Y of the dependent variable and the predicted value, denoted Y′, is the same as the correlation between the dependent variable and the independent variable; that is, rY × Y = r_{XY}. Thus,R and R^{2} in multiple regression play the same role as r and r^{2} in simple regression. The statistical test for R and R^{2}, however, uses the Fdistribution instead of the t distribution
The computations are timeconsuming, and fortunately, computers do them for us. Jackson and colleagues included R^{2} in Table 105(although they used lowercase r^{2}); it was 0.81 for model III (and is also shown in the NCSS output in Table 104). After ln BMI, age, and race are entered into the regression equation, R^{2} = 0.81 indicates that more than 80% of the variability in percent body fat is accounted for by knowing patients' BMI, age, and race. Because R^{2} is less than 1, we know that factors other than those included in the study also play a role in determining a person's percent body fat.
Selecting Variables for Regression Models
The primary purpose of Jackson and colleagues in their study of BMI and percent body fat was explanation; they used multiple regression analysis to learn how specific characteristics confounded the relationship between BMI and percent body fat. They also wanted to know how the characteristics interacted with one another, such as gender and race. Some research questions, however, focus on the prediction of the outcome, such as using the regression equation to predict of percent body fat in future subjects
Deciding on the variables that provide the best prediction is a process sometimes referred to as model building and is exemplified in Table 105. Selecting the variables for regression models can be accomplished in several ways. In one approach, all variables are introduced into the regression equation, called the “enter” method in SPSS and used in the multiple regression procedure in NCSS. Then, especially if the purpose is prediction, the variables that do not have significant regression coefficients are eliminated from the equation. The regression equation may be recalculated using only the variables retained because the regression coefficients have different values when some variables are removed from the analysis.
Computer programs also contain routines to select an optimal set of explanatory variables. One such procedure is called forward selection. Forward selection begins with one variable in the regression equation; then, additional variables are added one at a time until all statistically significant variables are included in the equation. The first variable in the regression equation is the X variable that has the highest correlation with the response variable Y. The next X variable considered for the regression equation is the one that increases R^{2} by the largest amount. If the increment in R^{2} is statistically significant by the F test, it is included in the regression equation. This stepbystep procedure continues until no X variables remain that produce a significant increase in R^{2}. The values for the regression coefficients are calculated, and the regression equation resulting from this forward selection procedure can be used to predict outcomes for future subjects. The increment in R^{2} was calculated by Jackson and colleagues; it is shown as r^{2}Δ in Table 105.
A similar backward elimination procedure can also be used; in it, all variables are initially included in the regression equation. The Xvariable that would reduce R^{2} by the smallest increment is removed from the equation. If the resulting decrease is not statistically significant, that variable is permanently removed from the equation. Next, the remaining X variables are examined to see which produces the next smallest decrease in R^{2}. This procedure continues until the removal of an X variable from the regression equation causes a significant reduction in R^{2}. That X variable is retained in the equation, and the regression coefficients are calculated.
When features of both the forward selection and the backward elimination procedures are used together, the method is called stepwise regression (stepwise selection). Stepwise selection is commonly used in the medical literature; it begins in the same manner as forward selection. After each addition of a new X variable to the equation, however, all previously entered X variables are checked to see whether they maintain their level of significance. Previously entered X variables are retained in the regression equation only if their removal would cause a significant reduction in R^{2}. The forward versus backward versus stepwise procedures have subtle advantages related to the correlations among the independent variables that cannot be covered in this text. They do not generally produce identical regression equations, but conceptually, all approaches determine a “parsimonious” equation using a subset of explanatory variables.
Some statistical programs examine all possible combinations of predictor values and determine the one that produces the overall highest R^{2}, such as All Possible Regression in NCSS. We do not recommend this procedure, however, and suggest that a more appealing approach is to build a model in a logical way. Variables are sometimes grouped according to their function, such as all demographic characteristics, and added to the regression equation as a group or block; this process is often called hierarchical regression; see exercise 7 for an example. The advantage of a logical approach to building a regression model is that, in general, the results tend to be more stable and reliable and are more likely to be replicated in similar studies.
Polynomial Regression
Polynomial regression is a special case of multiple regression in which each term in the equation is a power of X. Polynomial regression provides a way to fit a regression model to curvilinear relationships and is an alternative to transforming the data to a linear scale. For example, the following equation can be used to predict a quadratic relationship:
If a linear and cubic term do not provide an adequate fit, a cubic term, a fourthpower term, and so on, can also be included until an adequate fit is obtained
Jackson and colleagues (2002) used polynomial regression to fit separate curves for men and women, illustrated in Figure 101. Two approaches to polynomial regression can be used. The first method calculates squared terms, cubic terms, and so on; these terms are then entered one at a time using multiple regression. Another approach is to use a program that permits curve fitting, such as the regression curve estimation procedure in SPSS. We used the SPSS procedure to fit a quadratic curve of BMI to percent body fat for women. The regression equation was:
A plot is produced by SPSS is given in Figure 102.
Missing Observations
When studies involve several variables, some observations on some subjects may be missing. Controlling the problem of missing data is easier in studies in which information is collected prospectively; it is much more difficult when information is obtained from already existing records, such as patient charts. Two important factors are the percentage of observations that is missing and whether missing observations are randomly missing or missing because of some causal factor
Figure 102. Linear and quadratic curves for the relationship between BMI and percent body fat in females. (Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age, and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002; 26: 789–796. Table produced with SPSS Inc.; used with permission.) 
For example, suppose a researcher designs a case–control study to examine the effect of leg length inequality on the incidence of loosening of the femoral component after total hip replacement. Cases are patients who developed loosening of the femoral component, and controls are patients who did not. In reviewing the records of routine followup, the researcher found that leg length inequality was measured in some patients by using weightbearing anterior–posterior (AP) hip and lower extremity films, whereas other patients had measurements taken using nonweightbearing films. The type of film ordered during followup may well be related to whether the patient complained of hip pain; patients with symptoms were more likely to have received the weightbearing films, and patients without symptoms were more likely to have had the routine nonweightbearing films. A researcher investigating this question must not base the leg length inequality measures on weightbearing films only, because controls are less likely than cases to have weightbearing film measures in their records. In this situation, the missing leg length information occurred because of symptoms and not randomly.
The potential for missing observations increases in studies involving multiple variables. Depending on the cause of the missing observations, solutions include dropping subjects who have missing observations from the study, deleting variables that have missing values from the study, or substituting some value for the missing data, such as the mean or a predicted value, called imputing. SPSS has an option to estimate missing data with the mean for that variable calculated with the subjects who had the data. The Data Screening procedure (in Descriptive Statistics) in NCSS provides the option of substituting either the mean or a predicted score. Investigators in this situation should seek advice from a statistician on the best way to handle the problem.
Cross Validation
The statistical procedures for all regression models are based on correlations among the variables, which, in turn, are related to the amount of variation in the variables included in the study. Some of the observed variation in any variable, however, occurs simply by chance; and the same degree of variation does not occur if another sample is selected and the study is replicated. The mathematical procedures for determining the regression equation cannot distinguish between real and chance variation. If the equation is to be used to predict outcomes for future subjects, it should therefore be validated on a second sample, a process called cross validation. The regression equation is used to predict the outcome in the second sample, and the predicted outcomes are compared with the actual outcomes; the correlation between the predicted and actual values indicates how well the model fits. Crossvalidating the regression equation gives a realistic evaluation of the usefulness of the prediction it provides
In medical research we rarely have the luxury of crossvalidating the findings on another sample of the same size. Several alternative methods exist. First, researchers can hold out a proportion of the subjects for cross validation, perhaps 20% or 25%. The holdout sample should be randomly selected from the entire sample prior to the original analysis. The predicted outcomes in the holdout sample are compared with the actual outcomes, often using R^{2} to judge how well the findings crossvalidate.
Another method is the jackknife in which one observation is left out of the sample, call it x_{1}; regression is performed using the n – 1 observations, and the results are applied to x_{1}. Then this observation is returned to the sample, and another, x_{2}, is held out. This process continues until there is a predicted outcome for each observation in the sample; the predicted and actual outcomes are then compared.
The bootstrap method works in a similar manner although the goal is different. The bootstrap can be used with small samples to estimate the standard error and confidence intervals. A small holdout sample is randomly selected and the statistic of interest calculated. Then the holdout sample is returned to the original sample, and another holdout sample is selected. After a fairly large number of samples is analyzed, generally a minimum of 200, standard errors and confidence intervals can be estimated. In essence, the bootstrap method uses the data itself to determine the sampling distribution rather than the central limit theorem discussed in Chapter 4.
Both the jackknife and bootstrap are called resampling methods; they are very computerintensive and require special software. Kline and colleagues (2002) used a bootstrap method to develop confidence intervals for odds ratios in their study of the use of the ddimer test in the emergency department.
It is possible to estimate the magnitude of R or R^{2} in another sample without actually performing the cross validation. This R^{2} is smaller than the R^{2} for the original sample, because the mathematical formula used to obtain the estimate removes the chance variation. For this reason, the formula is called a formula for shrinkage. Many computer programs, including NCSS, SPSS, and SAS, provide both R^{2} for the sample used in the analysis as well as R^{2} adjusted for shrinkage, often referred to as the adjusted R^{2}. Refer to Table 104 where NCSS gives the “Adj R2” in the fifth row of the first column of the computer analysis.
Sample Size Requirements
The only easy way to determine how large a sample is needed in multiple regression or any multivariate technique is to use a computer program. Some rules of thumb, however, may be used for guidance. A common recommendation by statisticians calls for ten times as many subjects as the number of independent variables. For example, this rule of thumb prescribes a minimum of 60 subjects for a study predicting the outcome from six independent variables. Having a large ratio of subjects to variables decreases problems that may arise because assumptions are not met
Assumptions about normality in multiple regression are complicated, depending on whether the independent variables are viewed as fixed or random (as in fixedeffects model or randomeffects model in ANOVA), and they are beyond the scope of this text. To ensure that estimates of regression coefficients and multiple R and R^{2} are accurate representatives of actual population values, we suggest that investigators never perform regression without at least five times as many subjects as variables.
A more accurate estimate is found by using a computer power program. We used the PASS power program to find the power of a study using five predictor variables, as in the Jackson study (Table 105). We posed the question: How many subjects are needed to test whether a given variable increases R^{2} by 0.05, given that four variables are already in the regression equation and they collectively provide an R^{2} of 0.50? The output from the program is shown in Box 101. The power table indicates that a sample of 80 gives power of 0.84, assuming an α or P value of 0.05. The accompanying graph shows the power curve for different sample sizes and different values of α. As you can see, the sample of 359 females and 296 males in the study by Jackson and colleagues was more than adequate for the regression model.
CONTROLLING FOR CONFOUNDING
Analysis of Covariance
Analysis of covariance (ANCOVA) is the statistical technique used to control for the influence of a confounding variable.Confounding variables occur most often when subjects cannot be assigned at random to different groups, that is, when the groups of interest already exist. Gonzalo and colleagues (1996) (Chapters 7 and 8) predicted insulin sensitivity from body mass index (BMI); they wanted to control for age of the women and did so by adding age to the regression equation. When BMI alone is used to predict insulin sensitivity (IS) in hyperthyroid women, the regression equation is
where IS is the insulin sensitivity level. Using this equation, a hyperthyroid woman's insulin sensitivity level is predicted to decrease by 0.077 for each increase of 1 in BMI. For instance, a woman with a BMI of 25 has a predicted insulin sensitivity of 0.411. What would happen, however, if age were also related to insulin sensitivity? A way to control for the possible confounding effect of age is to include that variable in the regression equation. The equation with age included is
Using this equation, a hyperthyroid woman's insulin sensitivity level is predicted to decrease by 0.068 for each increase of 1 in BMI, holding age constant or independent of age. A 30yearold woman with a BMI of 25 has a predicted insulin sensitivity of 0.456, whereas a 60yearold woman with the same BMI of 25 has a predicted insulin sensitivity of 0.321
A more traditional use of ANCOVA is illustrated by a study of the negative influence of smoking on the cardiovascular system. Investigators wanted to know whether smokers have more ventricular wall motion abnormalities than nonsmokers (Hartz et al, 1984). They might use a t test to determine whether the mean number of wall motion abnormalities differ in these two groups. The investigators know, however, that wall motion abnormalities are also related to the degree of coronary stenosis, and smokers generally have a greater degree of coronary stenosis. Thus, any difference observed in the mean number of wall abnormalities between smokers and nonsmokers may really be a difference in the amount of coronary stenosis between these two groups of patients.
This situation is illustrated in the graph of hypothetical data in Figure 103; in the figure, the relationship between occlusion scores and wall motion abnormalities appears to be the same for smokers and nonsmokers. Nonsmokers, however, have both lower occlusion scores and lower numbers of wall motion abnormalities; smokers have higher occlusion scores and higher numbers of wall motion abnormalities. The question is whether the difference in wall motion abnormalities is due to smoking, to occlusion, or to both.
In this study, the investigators must control for the degree of coronary stenosis so that it does not confound (or confuse) the relationship between smoking and wall motion abnormalities. Useful methods to control for confounding variables are analysis of covariance (ANCOVA) and the Mantel–Haenszel chisquare procedure. Table 102 specifies ANCOVA when the dependent variable is numerical (eg, wall motion) and the independent measures are grouping variables on a nominal scale (eg, smoking versus nonsmoking), and confounding variables occur (eg, degree of coronary occlusion). If the dependent measure is also nominal, such as whether a patient has survived to a given time, the Mantel–Haenszel chisquare discussed in Chapter 9 can be used to control for the effect of a confounding (nuisance) variable. ANCOVA can be performed by using the methods of ANOVA; however, most medical studies use one of the regression methods discussed in this chapter.
Box 101. LINEAR AND QUADRATIC CURVES FOR THE RELATIONSHIP BETWEEN BMI AND PERCENT BODY FAT IN FEMALES.


Figure. No caption available. 
Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord2002;26:789–796. Output produced with PASS; used with permission.
If ANCOVA is used in this example, the occlusion score is called the covariate, and the mean number of wall motion abnormalities in smokers and nonsmokers is said to be adjusted for the occlusion score (or degree of coronary stenosis). Put another way, ANCOVA simulates the Y outcome observed if the value of X is held constant, that is, if all the patients had the same degree of coronary stenosis. This adjustment is achieved by calculating a regression equation to predict mean number of wall motion abnormalities from the covariate, degree of coronary stenosis, and from a dummy variable coded 1 if the subject is a member of the group (ie, a smoker) and 0 otherwise. For example, the regression equation determined for the hypothetical observations in Figure 103 is
The equation illustrates that smokers have a larger number of predicted wall motion abnormalities, because 1.28 is added to the equation if the subject is a smoker. The equation can be used to obtain the mean number of wall motion abnormalities in each group, adjusted for degree of coronary stenosis
Figure 103. Relationship between degree of coronary stenosis and ventricular wall motion abnormalities in smokers and nonsmokers (hypothetical data). 
If the relationship between coronary stenosis and ventricular motion is ignored, the mean number of wall motion abnormalities, calculated from the observations in Figure 102, is 3.33 for smokers and 1.00 for nonsmokers. If, however, ANCOVA is used to control for degree of coronary stenosis, the adjusted mean wall motion is 2.81 for smokers and 1.53 for nonsmokers, a difference of 1.28, represented by the regression coefficient for the dummy variable for smoking. In ANCOVA, the adjusted Y mean for a given group is obtained by (1) finding the difference between the group's mean on the covariate variable X, denoted, and the grand mean; (2) multiplying the difference by the regression coefficient; and (3) subtracting this product from the unadjusted mean. Thus, for group j, the adjusted mean is
(See Exercise 1.)
This result is consistent with our knowledge that coronary stenosis alone has some effect on abnormality of wall motion; the unadjusted means contain this effect as well as any effect from smoking. Controlling for the effect of coronary stenosis therefore results in a smaller difference in number of wall motion abnormalities, a difference related only to smoking.
Using hypothetical data, Figure 104 illustrates schematically the way ANCOVA adjusts the mean of the dependent variable if the covariate is important. Using unadjusted means is analogous to using a separate regression line for each group. For example, the mean value of Y for group 1 is found by using the regression line drawn through the group 1 observations to project the mean value X̅_{1} onto the Yaxis, denoted [Y with bar above]_{1} in Figure 104. Similarly, the mean of group 2 is found at [Y with bar above]_{2} by using the regression line to project the mean X̅_{2} in that group. The Y means in each group adjusted for the covariate (stenosis) are analogous to the projections based on the overall mean value of the covariate; that is, as though the two groups had the same mean value for the covariate. The adjusted means for groups 1 and 2, Adj. [Y with bar above]_{1} and Adj. [Y with bar above]_{2}, are illustrated by the dotted line projections of X̅ from each separate regression line in Figure 104.
Figure 104. Illustration of means adjusted using analysis of covariance. 
ANCOVA assumes that the relationship between the covariate (X variable) and the dependent variable (Y) is the same in both groups, that is, that any relationship between coronary stenosis and wall motion abnormality is the same for smokers and nonsmokers. This assumption is equivalent to requiring that the regression slopes be the same in both groups; geometrically, ANCOVA asks whether a difference exists between the intercepts, assuming the slopes are equal.
ANCOVA is an appropriate statistical method in many situations that occur in medical research. For example, age is a variable that affects almost everything studied in medicine; if preexisting groups in a study have different age distributions, investigators must adjust for age before comparing the groups on other variables, just as Gonzalo and colleagues recognized. The methods illustrated in Chapter 3 to adjust mortality rates for characteristics such as age and birth weight are used when information is available on groups of individuals; when information is available on individuals themselves, ANCOVA is used.
Before leaving this section, we point out some important aspects of ANCOVA. First, although only two groups were included in the example, ANCOVA can be used to adjust for the effect of a confounding variable in more than two groups. In addition, it is possible to adjust for more than one confounding variable in the same study, and the confounding variables may be either nominal or numerical. Thus, it is easy to see why the multiple regression model for analysis of covariance provides an ideal method to incorporate confounding variables.
Finally, ANCOVA can be considered as a special case of the more general question of comparing two regression lines (discussed in Chapter 8). In ANCOVA, we assume that the slopes are equal, and attention is focused on the intercept. We can also perform the more global test of both slope and intercept, however, by using multiple regression. In Presenting Problem 4 in Chapter 8 on insulin sensitivity (Gonzalo et al, 1996), interest focused on comparing the regression lines predicting insulin activity from body mass index (BMI) in women who had normal versus elevated thyroid levels. ANCOVA can be used for this comparison using dummy coding. If we let X be BMI, Y be insulin sensitivity level, and Z be a dummy variable, where Z = 1 if the woman is hyperthyroid and Z = 0 for controls, then the multipleregression model for testing whether the two regression lines are the same (coincident) is
The regression lines have equal slopes and are parallel when b_{3} is 0, that is, no interaction between the independent variable X and the group membership variable Z. The regression lines have equal intercepts and equal slopes (are coincident) if both b_{2} and b_{3} are 0; thus, the model becomes the simple regression equation Y = a + bX. The statistical test for b_{2} and b_{3} is the t test discussed in the section titled, “Statistical Tests for the Regression Coefficient.”
Generalized Estimating Equations (GEE)
Many research designs, including both observational studies and clinical trials, concern observations that are clustered or hierarchical. A group of methods has been developed for these special situations. To illustrate, a study to examine the effect of different factors on complication rates following total knee arthroplasty was undertaken in a province of Canada (Kreder et al, 2003). Outcomes included length of hospital stay, inpatient complications, and mortality. Can the researchers examine the outcomes for patients and conclude that any differences are due the risk factors? The statistical methods we have examined thus far assume that one observation is independent from another. The problem with this study design, however, is that the outcome for patients operated on by the same surgeon may be related to factors other than the surgical method, such as the skill level of the surgeon. In this situation, patients are said to be nested within physicians
Many other examples come to mind. Comparing the efficacy of medical education curricula is difficult because students are nested within medical schools. Comparing health outcomes for children within a community is complicated by the fact that children are nested within families. Many clinical trials create nested situations, such as when trials are carried out in several medical centers. The issue arises of how to define the unit of analysis—should it be the students or the school? the children or the families? the patients or the medical center?
The group of methods that accommodates these types of research questions include generalized estimating equations (GEE), multilevel modeling, and the analysis of hierarchically structured data. Most of these methods have been developed within the last decade and statistical software is just now becoming widely available. In addition to some specialized statistical packages, SAS, Stata, and SPSS contain procedures to accommodate hierarchical data. Using these models is more complex than some of the other methods we have discussed, and it is relatively easy to develop a model that is meaningless or misleading. Investigators who have research designs that involve nested subjects should consult a biostatistician for assistance.
PREDICTING NOMINAL OR CATEGORICAL OUTCOMES
In the regression model discussed in the previous section, the outcome or dependent Y variable is measured on a numerical scale. When the outcome is measured on a nominal scale, other approaches must be used. Table 102 indicates that several methods can be used to analyze problems with several independent variables when the dependent variable is nominal. First we discuss logistic regression, a method that is frequently used in the health field. One reason for the popularity of logistic regression is that many outcomes in health are nominal, actually binary, variables—they either occur or do not occur. The second reason is that the regression coefficients obtained in logistic regression can be transformed into odds ratios. So, in essence, logistic regression provides a way to obtain an odds ratio for a given risk factor that controls for, or is adjusted for, confounding variables; in other words, we can do analysis of covariance with logistic regression as well as with multiple linear regression
Other methods are loglinear analysis and several methods that attempt to classify subjects into groups. These methods appear occasionally in the medical literature, and we provide a brief illustration, primarily so that readers can have an intuitive understanding of their purpose. The classification methods are discussed in the section titled “Methods for Classification.”
Logistic Regression
Logistic regression is commonly used when the independent variables include both numerical and nominal measures and the outcome variable is binary (dichotomous). Logistic regression can also be used when the outcome has more than two values (Hosmer and Lemeshow, 2000), but its most frequent use is as in Presenting Problem 2, which illustrates the use of logistic regression to identify trauma patients who are alcoholpositive, a yesorno outcome. Soderstrom and his coinvestigators (1997) wanted to develop a model to help emergency department staff identify the patients most likely to have blood alcohol concentrations (BAC) in excess of 50 mg/dL at the time of admission. The logistic model gives the probability that the outcome, such as high BAC, occurs as an exponential function of the independent variables. For example, with three independent variables, the model is
where b_{0} is the intercept, b_{1}, b_{2}, and b_{3} are the regression coefficients, and exp indicates that the base of the natural logarithm (2.718) is taken to the power shown in parentheses (ie, the antilog). The equation can be derived by specifying the variables to be included in the equation or by using a variable selection method similar to the ones for multiple regression. A chisquare test (instead of the t or F test) is used to determine whether a variable adds significantly to the prediction
In the study described in Presenting Problem 2, the variables used by the investigators to predict blood alcohol concentrations included the variables listed in Table 107. The investigators coded the values of the independent variables as 0 and 1, a method useful both for dummy variables in multiple regression and for variables in logistic regression. This practice makes it easy to interpret the odds ratio. In addition, if a goal is to develop a score, as is the case in the study by Soderstrom and coinvestigators, the coefficient associated with a given variable needs to be included in the score only if the patient has a 1 on that variable. For instance, if patients are more likely to have BAC ≥ 50 mg/dL on weekends, the score associated with day of week is not included if the injury occurs on a weekday.
Table 107. Variables, codes, and frequencies for variables.^{a} 


The investigators calculated logistic regression equations for each of four groups: males with intentional injury, males with unintentional injury, females with intentional injury, and females with unintentional injury. The results of the analysis on males who were injured unintentionally are given in Table 108.
We need to know which value is coded 1 and which 0 in order to interpret the results. For example, time of day has a negative regression coefficient. The hours of 6 AM 6 PM are coded as 1, so a male coming to the emergency department with unintentional injuries in the daytime is less likely to have BAC ≥ 50 mg/dL than a male with unintentional injuries at night. The age variable is not significant (P > 0.268). Interpreting the equation for the other variables indicates that males with unintentional injuries who come to the emergency department at night and on weekends and are Caucasian are more likely to have elevated blood alcohol levels.
Table 108. Logistic regression report for men with unintentional injury.^{a} 


The logistic equation can be used to find the probability for any given individual. For instance, let us find the probability that a 27yearold Caucasian man who comes to the emergency department at 2 PM on Thursday has BAC ≥ 50 mg/dL. The regression coefficients from Table 108 are
and we evaluate it as follows:
Substituting 2.36 in the equation for the probability:
Therefore, the chance that this man has a high BAC is less than 1 in 10. See Exercise 3 to determine the likelihood of a high BAC if the same man came to the emergency department on a Saturday night
One advantage of logistic regression is that it requires no assumptions about the distribution of the independent variables. Another is that the regression coefficient can be interpreted in terms of relative risks in cohort studies or odds ratios in case–control studies. In other words, the relative risk of an elevated BAC in males with unintentional trauma during the day is exp (1.845) = 0.158. The relative risk for night is the reciprocal, 1/0.158 = 6.33; therefore, males with unintentional injuries who come to the ER at night are more than six times more likely to have BAC ≥ 50 mg/dL than males coming during the day.
How can readers easily tell which odds ratios are statistically significant? Recall from Chapter 8 that if the 95% confidence interval does not include 1, we can be 95% sure that the factor associated with the odds ratio either is a significant risk or provides a significant level of protection. Do any of the independent variables in Table 108 have a 95% confidence interval for the odds ratio that contains 1? Did you already know without looking that it would be age because the age variable is not statistically significant?
The overall results from a logistic regression may be tested with Hosmer and Lemeshow's goodness of fit test. The test is based on the chisquare distribution. A P value ≥ 0.05 means that the model's estimates fit the data at an acceptable level.
There is no straightforward statistic to judge the overall logistic model as R^{2} is used in multiple regression. Some statistical programs giveR^{2}, but it cannot be interpreted as in multiple regression because the predicted and observed outcomes are nominal. Several other statistics are available as well, including Cox and Snell's R^{2} and a modification called Nagelkerke's R^{2}, which is generally larger than Cox and Snell'sR^{2}.
Before leaving the topic of logistic regression, it is worthwhile to inspect the classification table in Table 108. This table gives the actual and the predicted number of males with unintentional injuries who had normal versus elevated BAC. The logistic equation tends to underpredict those with elevated concentrations: 470 males are predicted versus the 682 who actually had BAC ≥ 50 mg/dL. Overall, the prediction using the logistic equation correctly classified 76.62% of these males. Although this sounds rather impressive, it is important to compare this percentage with the baseline: 74.12% of the time we would be correct if we simply predicted a male to have normal BAC. Can you recall an appropriate way to compensate for or take the baseline into consideration? Although computer programs typically do not provide the kappa statistic, discussed in Chapter 5, it provides a way to evaluate the percentage correctly classified (see Exercise 4). Other measures of association are used so rarely in medicine that we did not discuss them in Chapter 8. SPSS provides two nonparametric correlations, the lambda correlation and the tau correlation, that can be interpreted as measures of strength of the relationship between observed and predicted outcomes.
LogLinear Analysis
Psoriasis, a chronic, inflammatory skin disorder characterized by scaling erythematous patches and plaques of skin, has a strong genetic influence—about one third of patients have a positive family history. Stuart and colleagues (2002) conducted a study to determine differences in clinical manifestation between patients with positive and negative family histories of psoriasis and with earlyonset versus lateonset disease. This study was used in Exercise 7 in Chapter 7
The hypothesis was that the variables age at onset (in 10year categories), onset (early or late), and familial status (sporadic or familial) had no effect on the occurrence of joint complaints. Results from the analysis of age, familial status, and frequency of joint complaints are given in Table 109.
Each independent variable in this research problem is measured on a categorical or nominal scale (age, onset, and familial status), as is the outcome variable (occurrence of joint complaints). If only two variables are being analyzed, the chisquare method introduced in Chapter 6can be used to determine whether a relationship exists between them; with three or more nominal or categorical variables, a statistical method called loglinear analysis is appropriate. Loglinear analysis is analogous to a regression model in which all the variables, both independent and dependent, are measured on a nominal scale. The technique is called loglinear because it involves using the logarithm of the observed frequencies in the contingency table.
Table 109. Frequency of joint complaints by familial status, stratified by age at examination. 


Stuart and colleagues (2002) concluded that joint complaints and familial psoriasis were conditionally independent given age at examination, but that age at examination was not independent of either joint complaints or a family history.
Loglinear analysis may also be used to analyze multidimensional contingency tables in situations in which no distinction exists between independent and dependent variables, that is, when investigators simply want to examine the relationship among a set of nominal measures. The fact that loglinear analysis does not require a distinction between independent and dependent variables points to a major difference between it and other regression models—namely, that the regression coefficients are not interpreted in loglinear analysis.
PREDICTING A CENSORED OUTCOME: COX PROPORTIONAL HAZARD MODEL
In Chapter 9, we found that special methods must be used when an outcome has not yet been observed for all subjects in the study sample. Studies of timelimited outcomes in which there are censored observations, such as survival, naturally fall into this category; investigators usually cannot wait until all patients in the study experience the event before presenting information
Many times in clinical trials or cohort studies, investigators wish to look at the simultaneous effect of several variables on length of survival. For example, in the study described in Presenting Problem 3, Crook and her colleagues (1997) wanted to evaluate the relationship of pretreatment prostatespecific antigen (PSA) and posttreatment nadir PSA on the failure pattern of radiotherapy for treating localized prostate carcinoma. They categorized failures as biochemical, local, and distant. They analyzed data from a cohort study of 207 patients, but only 68 had a failure due to any cause in the 70 months during which the study was underway. These 68 observations on failure were therefore censored. The independent variables they examined included the Gleason score, the T classification, whether the patient had received hormonal treatment, the PSA before treatment, and the lowest PSA following treatment.
Table 102 indicates that the regression technique developed by Cox (1972) is appropriate when timedependent censored observations are included. This technique is called the Cox regression, or proportional hazard, model. In essence this model allows the covariates (independent variables) in the regression equation to vary with time. The dependent variable is the survival time of the jth patient, denotedY_{j}. Both numerical and nominal independent variables may be used in the model.
The Cox regression coefficients can be used to determine the relative risk or odds ratio (introduced in Chapter 3) associated with each independent variable and the outcome variable, adjusted for the effect of all other variables in the equation. Thus, instead of giving adjusted means, as ANCOVA does in regression, the Cox model gives adjusted relative risks. We can also use a variety of methods to select the independent variables that add significantly to the prediction of the outcome, as in multiple regression; however, a chisquare test (instead of the F test) is used to test for significance.
The Cox proportional hazard model involves a complicated exponential equation (Cox, 1972). Although we will not go into detail about the mathematics involved in this model, its use is so common in medicine that an understanding of the process is needed by readers of the literature. Our primary focus is on the application and interpretation of the Cox model.
Understanding the Cox Model
Recall from Chapter 9 that the survival function gives the probability that a person will survive the next interval of time, given that he or she has survived up until that time. The hazard function, also defined in Chapter 9, is in some ways the opposite: it is the probability that a person will die (or that there will be a failure) in the next interval of time, given that he or she has survived until the beginning of the interval. The hazard function plays a key role in the Cox model
The Cox model examines two pieces of information: the amount of time since the event first happened to a person and the person's observations on the independent variables. Using the Crook example, the amount of time might be 3 years, and the observations would be the patient's Gleason score, T classification, whether he had been treated with hormones, and the two PSA scores (pretreatment and lowest posttreatment). In the Cox model, the length of time is evaluated using the hazard function, and the linear combination of the independent values (like the linear combination we obtain when we use multiple regression) is the exponent of the natural logarithm, e. For example, for the Crook study, the model is written as
In words, the model is saying that the probability of dying in the next time interval, given that the patient has lived until this time and has the given values for Gleason score, T classification, and so on, can be found by multiplying the baseline hazard (h_{0}) by the natural log raised to the power of the linear combination of the independent variables. In other words, a given person's probability of dying is influenced by how commonly patients die and the given person's individual characteristics. If we take the antilog of the linear combination, we multiply rather than add the values of the covariates. In this model, the covariates have a multiplicative, or proportional, effect on the probability of dying—thus, the term “proportional hazard” model.
An Example of the Cox Model
In the study described in Presenting Problem 3, Crook and her colleagues (1997) used the Cox proportional hazard model to examine the relationship between pretreatment PSA and posttreatment PSA nadir and treatment failure in men with prostate carcinoma following treatment with radiotherapy. Failure was categorized as chemical, local, or distant. The investigators wanted to control for possible confounding variables, including the Gleason score, the T classification, both measures of severity, and whether the patient received hormones prior to the radiotherapy. The outcome is a censored variable, the amount of time before the treatment fails, so the Cox proportional hazard model is the appropriate statistical method. We use the results of analysis using SPSS, given in Table 1010, to point out some salient features of the method
Both numerical and nominal variables can be used as independent variables in the Cox model. If the variables are nominal, it is necessary to tell the computer program so they can be properly analyzed. SPSS prints this information. PRERTHOR, pretreatment hormone therapy, is recoded so that 0 = no and 1 = yes. Prior to doing the analysis, we recoded the Gleason score into a variable called GSCORE with two values: 0 for Gleason scores 2–6 and 1 for Gleason scores 7–10. The T classification variable, TUMSTAGE, was recoded by the computer program using dummy variable coding. Note that for four values of TUMSTAGE, only three variables are needed, with the three more advanced stages compared with the lowest stage, T1b2.
Among the 207 men in the study, 68 had experienced a failure by the time the data were analyzed. The authors reported a median followup of 36 months with a range of 12 to 70 months. The log likelihood statistic (LL) is used to evaluate the significance of the overall model; smaller values indicate that the data fit the model better. The change in the log likelihood associated with the initial (full) model in which no independent variables are included in the equation and the log likelihood after the variables are entered is calculated. In this example, the change is 72.706 (highlighted in Table 1010), and it is the basis of the chisquare statistic used to determine the significance of the model. The significance is reported, as often occurs with computer programs, as 0.0000.
In addition to testing the overall model, it is possible to test each independent variable to see if it adds significantly to the prediction of failure. Were any of the potentially confounding variables significant? The significance of TUMSTAGE requires some explanation. The variable itself is significant, with P = 0.0025 (shaded in the table). The TUMSTAGE(3) variable (which indicates the patient has T3–4 stage tumor), however, is the one that really matters because it is the only significant stage (P = 0.0066). Note that Gleason score and hormone therapy were not significant. Was either of the PSA values important in predicting failure? It appears that the pretreatment PSA is not significant, but the lowest PSA (NADIRPSA) reached following treatment has a very low P value.
As in logistic regression, the regression coefficients in the Cox model can be interpreted in terms of relative risks or odds ratios (by finding the antilog) if they are based on independent binary variables, such as hormone therapy. For this reason, many researchers divide independent variables into two categories, as we did with Gleason score, even though this practice can be risky if the correct cutpoint is not selected. The T classification variable was recoded as three dummy variables to facilitate interpretation in terms of odds ratios for each stage. The odds ratios are listed under the column titled “Exp (B)” in Table 1010. Using the T3–4 stage (TUMSTAGE(3)) as an illustration, the antilog of the regression coefficient, 1.5075, is exp (1.5075) = 4.5156. Note that the 95% confidence interval goes from approximately 1.52 to 13.40; because this interval does not contain 1, the odds ratio is statistically significant (consistent with the P value).
Table 1010. Results from Cox proportional hazard model using both pretreatment and posttreatment variables. 


Crook and colleagues (1997) also computed the Cox model using only the variables known prior to treatment (see Exercise 8).
Importance of the Cox Model
The Cox model is very useful in medicine, and it is easy to see why it is being used with increasing frequency. It provides the only valid method of predicting a timedependent outcome, and many healthrelated outcomes are related to time. If the independent variables are divided into two categories (dichotomized), the exponential of the regression coefficient, exp (b), is the odds ratio, a useful way to interpret the risk associated with any specific factor. In addition, the Cox model provides a method for producing survival curves that are adjusted for confounding variables. The Cox model can be extended to the case of multiple events for a subject, but that topic is beyond our scope. Investigators who have repeated measures in a timetosurvival study are encouraged to consult a statistician.
METAANALYSIS
Metaanalysis is a way to combine results of several independent studies on a specific topic. Metaanalysis is different from the methods discussed in the preceding sections because its purpose is not to identify risk factors or to predict outcomes for individual patients; rather, this technique is applicable to any research question. We briefly introduced metaanalysis in Chapter 2. Because we could not talk about it in detail until the basics of statistical tests (confidence limits, P values, etc) were explained, we include it in this chapter. It is an important technique increasingly used for studies in health and it can be looked on as an extension of multivariate analysis
The idea of summarizing a set of studies in the medical literature is not new; review articles have long had an important role in helping practicing physicians keep up to date and make sense of the many studies on any given topic. Metaanalysis takes the review article a step further by using statistical procedures to combine the results from different studies. Glass (1977) developed the technique because many research projects are designed to answer similar questions, but they do not always come to similar conclusions. The problem for the practitioner is to determine which study to believe, a problem unfortunately too familiar to readers of medical research reports.
Sacks and colleagues (1987) reviewed metaanalyses of clinical trials and concluded that metaanalysis has four purposes: (1) to increase statistical power by increasing the sample size, (2) to resolve uncertainty when reports do not agree, (3) to improve estimates of effect size, and (4) to answer questions not posed at the beginning of the study. Purpose 3 requires some expansion because the concept of effect size is central to metaanalysis. Cohen (1988) developed this concept and defined effect size as thedegree to which the phenomenon is present in the population. An effect size may be thought of as an index of how much difference exists between two groups—generally, a treatment group and a control group. The effect size is based on means if the outcome is numerical, on proportions or odds ratios if the outcome is nominal, or on correlations if the outcome is an association. The effect sizes themselves are statistically combined in metaanalysis.
Veenstra and colleagues (1999) used metaanalysis to evaluate the efficacy of impregnating central venous catheters with an antiseptic. They examined the literature, using manual and computerized searches, for publications containing the words chlorhexidine, antiseptic, andcatheter and found 215 studies. Of these, 24 were comparative studies in humans. Nine studies were eliminated because they were not randomized, and another two were excluded based on the criteria for defining catheter colonization and catheterrelated bloodstream infection. Ten studies examined both outcomes, two examined only catheter colonization, and one reported only catheterrelated bloodstream infection.
Two authors independently read and evaluated each article. They reviewed the sample size, patient population, type of catheter, catheterization site, other interventions, duration of catheterization, reports of adverse events, and several other variables describing the incidence of colonization and catheterrelated bloodstream infection. The authors also evaluated the appropriateness of randomization, the extent of blinding, and the description of eligible subjects. Discrepancies between the reviewers were resolved by a third author. Some basic information about the studies evaluated in this metaanalysis is given in Table 1011.
The authors of the metaanalysis article calculated the odds ratios and 95% confidence intervals for each study and used a statistical method to determine summary odds ratios over all the studies. These odds ratios and intervals for the outcome of catheterrelated bloodstream infection are illustrated in Figure 105. This figure illustrates the typical way findings from metaanalysis studies are presented. Generally the results from each study are shown, and the summary or combined results are given at the bottom of the figure. When the summary statistic is the odds ratio, a line representing the value of 1 is drawn to make it easy to see which of the studies have a significant outcome.
From the data in Table 1011 and Figure 105, it appears that only one study (of the 11) reported a statistically significant outcome because only one has a confidence interval that does not contain 1. The entire confidence interval in Maki and associates' study (1997) is less than 1, indicating that these investigators found a protective effect when using the treated catheters. Of interest is the summary odds ratio, which illustrates that by pooling the results from 11 studies, treating the catheters appears to be beneficial. Several of the studies had relatively small sample sizes, however, and the failure to find a significant difference may be due to low power. Using metaanalysis to combine the results from these studies can provide insight on this issue.
A metaanalysis does not simply add the means or proportions across studies to determine an “average” mean or proportion. Although several different methods can be used to combine results, they all use the same principle of determining an effect size in each study and then combining the effect sizes in some manner. The methods for combining the effect sizes include the z approximation for comparing two proportions (Chapter 6); the t test for comparing two means (Chapter 6); the P values for the comparisons, and the odds ratio as shown in Veenstra and colleagues' study (1999). The values corresponding to the effect size in each study are the numbers combined in the metaanalysis to provide a pooled (overall) P value or confidence interval for the combined studies. The most commonly used method for reporting metaanalyses in the medical literature is the odds ratio with confidence intervals.
In addition to being potentially useful when published studies reach conflicting conclusions, metaanalysis can help raise issues to be addressed in future clinical trials. The procedure is not, however, without its critics, and readers should be aware of some of the potential problems in its use. To evaluate metaanalysis, LeLorier and associates (1997) compared the results of a series of large randomized, controlled trials with relevant previously published metaanalyses. Their results were mixed: They found that metaanalysis accurately predicted the outcome in only 65% of the studies; however, the difference between the trial results and the metaanalysis results was statistically significant in only 12% of the comparisons. Ioannidis and colleagues (1998) determined that the discrepancies in the conclusions were attributable to different disease risks, different study protocols, varying quality of the studies, and possible publication bias (discussed in a following section). These reports serve as a useful reminder that welldesigned clinical trials remain a critical source of information.
Studies designed in dissimilar manners should not be combined. In performing a metaanalysis, investigators should use clear and wellaccepted criteria for deciding whether studies should be included in the analysis, and these criteria should be stated in the published metaanalysis.
Most metaanalyses are based on the published literature, and some people believe it is easier to publish studies with results than studies that show no difference. This potential problem is called publication bias. Researchers can take at least three important steps to reduce publication bias. First, they can search for unpublished data, typically done by contacting the authors of published articles. Veenstra and his colleagues (1999) did this and contacted the manufacturer of the treated catheters as well but were unable to identify any unpublished data. Second, researchers can perform an analysis to see how sensitive the conclusions are to certain characteristics of the studies. For instance, Veenstra and colleagues assessed sources of heterogeneity or variation among the studies and reported that excluding these studies had no substantive effect on the conclusions. Third, investigators can estimate how many studies showing no difference would have to be done but not published to raise the pooled P value above the 0.05 level or produce a confidence interval that includes 1 so that the combined results would no longer be significant. The reader can have more confidence in the conclusions from a metaanalysis that finds a significant effect if a large number of unpublished negative studies would be required to repudiate the overall significance. The increasing use of computerized patient databases may lessen the effect of publication bias in future metaanalyses. Montori and colleagues (2000) provide a review of publication bias for clinicians.
Table 1011. Characteristics of studies comparing antisepticimpregnated with control catheters. 



Figure 105. Analysis of catheterrelated bloodstream infection in trials comparing chlorhexidine/silver sulfadiazineimpregnated central venous catheters with nonimpregnated catheters. The diamond indicates odds ratio (OR) and 95% confidence interval (CI). Studies are ordered by increasing mean duration of catheterization in the treatment group. The size of the squares is inversely proportional to the variance of the studies. (Reproduced, with permission, from Veenstra DL, Saint S, Saha S, Lumley T, Sullivan SD: Efficacy of antisepticimpregnated central venous catheters in preventing catheterrelated bloodstream infection. JAMA 1999; 281: 261–267. Copyright Š 1999, American Medical Association.) 
The Cochrane Collection is a large and growing database of metaanalyses that were done according to specific guidelines. Each metaanalysis contains a description and an assessment of the methods used in the articles that constitute the metaanalysis. Graphs such as Figure 105 are produced, and, if appropriate, graphs for subanalyses are presented. For instance, if both cohort studies and clinical trials have been done on a given topic, the Cochrane Collection presents a separate figure for each. The Cochrane Collection is available on CDROM or via the Internet for an annual fee. The Cochrane Web site states that: “Cochrane reviews (the principal output of the Collaboration) are published electronically in successive issues of The Cochrane Database of Systematic Reviews. Preparation and maintenance of Cochrane reviews is the responsibility of international collaborative review groups.”
No one has argued that metaanalyses should replace clinical trials. Veenstra and his colleagues (1999) conclude that a large trial may be warranted to confirm their findings. Despite their shortcomings, metaanalyses can provide guidance to clinicians when the literature contains several studies with conflicting results, especially when the studies have relatively small sample sizes. Furthermore, based on the increasingly large number of published metaanalyses, it appears that this method is here to stay. As with all types of studies, however, the methods used in a metaanalysis need to be carefully assessed before the results are accepted.
METHODS FOR CLASSIFICATION
Several multivariate methods can be used when the research question is related to classification. When the goal is to classify subjects into groups, discriminant analysis, cluster analysis, and propensity score analysis are appropriate. These methods all involve multiple measurements on each subject, but they have different purposes and are used to answer different research questions.
Discriminant Analysis
Logistic regression is used extensively in the biologic sciences. A related technique, discriminant analysis, although used with less frequency in medicine, is a common technique in the social sciences. It is similar to logistic regression in that it is used to predict a nominal or categorical outcome. It differs from logistic regression, however, in that it assumes that the independent variables follow a multivariate normal distribution, so it must be used with caution if some X variables are nominal
The procedure involves determining several discriminant functions, which are simply linear combinations of the independent variables that separate or discriminate among the groups defined by the outcome measure as much as possible. The number of discriminant functions needed is determined by a multivariate test statistic called Wilks' lambda. The discriminant functions' coefficients can be standardized and then interpreted in the same manner as in multiple regression to draw conclusions about which variables are important in discriminating among the groups.
Leone and coworkers (2002) wanted to identify characteristics that differentiate among expert adolescent female athletes in four different sports. Body mass, height, girth of the biceps and calf, skinfold measures, measures of aerobic power, and flexibility were among the measures they examined. Sports included were tennis with 15 girls, skating with 46, swimming with 23, and volleyball with 16. Discriminant analysis is useful when investigators want to evaluate several explanatory variables and the goal is to classify subjects into two or more categories or groups, such as that defined by the four sports.
Their analysis revealed three significant discriminant functions. The first function discriminated between skaters and the other three groups; the second reflected differences between volleyball players and swimmers, and the third between swimmers and tennis players. They concluded that adolescent female athletes show physical and biomotor differences that distinguish among them according to their sport.
Although discriminant analysis is most often employed to explain or describe factors that distinguish among groups of interest, the procedure can also be used to classify future subjects. Classification involves determining a separate prediction equation corresponding to each group that gives the probability of belonging to that group, based on the explanatory variables. For classification of a future subject, a prediction is calculated for each group, and the individual is classified as belonging to the group he or she most closely resembles.
Factor Analysis
Andrewes and colleagues (2003) wanted to know how scores on the Emotional and Social Dysfunction Questionnaire (ESDQ) can be used to help decide the level of support needed following brain surgery. Similarly, the Medical Outcomes Study Short Form 36 (MOSSF36) is a questionnaire commonly used to measure patient outcomes (Stewart et al, 1988). In examples such as these, tests with a large number of items are developed, patients or other subjects take the test, and scores on various items are combined to produce scores on the relevant factors
The MOSSF36 is probably used more frequently than any other questionnaire to measure functional outcomes and quality of life; it has been used all over the world and in patients with a variety of medical conditions. The questionnaire contains 36 items that are combined to produce a patient profile on eight concepts: physical functioning, rolephysical, bodily pain, general health, vitality, social functioning, roleemotional, and mental health. The first four concepts are combined to give a measure of physical health, and the last four concepts are combined to give a measure of mental health. The developers used factor analysis to decide how to combine the questions to develop these concepts.
In a research problem in which factor analysis is appropriate, all variables are considered to be independent; in other words, there is no desire to predict one on the basis of others. Conceptually, factor analysis works as follows: First, a large number of people are measured on a set of items; a rule of thumb calls for at least ten times as many subjects as items. The second step involves calculating correlations. To illustrate, suppose 500 patients answered the 36 questions on the MOSSF36. Factor analysis answers the question of whether some of the items group together in a logical way, such as items that measure the same underlying component of physical activity. If two items measure the same component, they can be expected to have higher correlations with each other than with other items.
In the third step, factor analysis manipulates the correlations among the items to produce linear combinations, similar to a regression equation without the dependent variable. The difference is that each linear combination, called a factor, is determined so that the first one accounts for the most variation among the items, the second factor accounts for the most residual variation after the first factor is taken into consideration, and so forth. Typically, a small number of factors account for enough of the variation among subjects that it is possible to draw inferences about a patient's score on a given factor. For example, it is much more convenient to refer to scores for physical functioning, rolephysical, bodily pain, and so on, than to refer to scores on the original 36 items. Thus, the fourth step involves determining how many factors are needed and how they should be interpreted.
Andrewes and colleagues analyzed the ESDQ, a questionnaire designed for braindamaged populations.
They performed a factor analysis of the ratings by the partner or caretaker of 211 patients. They found that the relationships among the questions could be summarized by eight factors, including anger, helplessness, emotional dyscontrol, indifference, inappropriateness, fatigue, maladaptive behavior, and insight. The researchers subsequently used the scores on the factors for a discriminant analysis to differentiate between the braindamaged patients and a control group with no cerebral dysfunction and found significant discrimination.
Investigators who use factor analysis usually have an idea of what the important factors are, and they design the items accordingly. Many other issues are of concern in factor analysis, such as how to derive the linear combinations, how many factors to retain for interpretation, and how to interpret the factors. Using factor analysis, as well as the other multivariate techniques, requires considerable statistical skill.
Cluster Analysis
A statistical technique similar conceptually to factor analysis is cluster analysis. The difference is that cluster analysis attempts to find similarities among the subjects that were measured instead of among the measures that were made. The object in cluster analysis is to determine a classification or taxonomic scheme that accounts for variance among the subjects. Cluster analysis can also be thought of as similar to discriminant analysis, except that the investigator does not know to which group the subjects belong. As in factor analysis, all variables are considered to be independent variables
Cluster analysis is frequently used in archeology and paleontology to determine if the existence of similarities in objects implies that they belong to the same taxon. Biologists use this technique to help determine classification keys, such as using leaves or flowers to determine appropriate species. A study by Penzel and colleagues (2003) used cluster analysis to examine the relationships among chromosomal imbalances in thymic epithelial tumors. Journalists and marketing analysts also use cluster analysis, referred to in these fields as Qtype factor analysis, as a way to classify readers and consumers into groups with common characteristics.
Propensity Scores
The propensity score method is an alternative to multiple regression and analysis of covariance. It provides a creative method to control for an entire group of confounding variables. Conceptually, a propensity score is found by using the confounding variables as predictors of the group to which a subject belongs; this step is generally accomplished by using logistic regression. For example, many cohort studies are handicapped by the problem of many confounding variables, such as age, gender, race, comorbidities, and so forth. Once the outcome is known for the subjects in the cohort, the confounding variables are used to develop a logistic regression equation to predict whether a patient has the outcome or not. This prediction, based on a combination of the confounding variables, is calculated for all subjects and then used as the confounding variable in subsequent analyses. Developers of the technique maintain it does a better job of controlling for confounding variables (Rubin, 1997). See Katzan and colleagues (2003) for an example of the application of propensity score analysis in a clinical study to determine the effect of pneumonia on mortality in patients with acute stroke.
Classification and Regression Tree (CART) Analysis
Classification and regression tree (CART) analysis is an approach to analyzing large databases to find significant patterns and relationships among variables. The patterns are then used to develop predictive models for classifying future subjects. As an example, CART was used in a study of 105 patients with stage IV colon or rectal cancer (Dixon et al, 2003). CART identified optimal cut points for carcinoembryonic antigen (CEA) and albumen (ALB) to form four groups of patients: low CEA with high ALB, low CEA with low ALB, high CEA with high ALB, and high CEA with low ALB. A survival analysis (Kaplan–Meier) was then used to compare survival times in these four groups. In another application of CART analysis, researchers were successful in determining the values of semen measurements that discriminate between fertile and infertile men (Guzick et al, 2001). The method requires special software and extensive computing power.
MULTIPLE DEPENDENT VARIABLES
Multivariate analysis of variance and canonical correlation are similar to each other in that they both involve multiple dependentvariables as well as multiple independent variables.
Multivariate Analysis of Variance
Multivariate analysis of variance (MANOVA) conceptually (although not computationally) is a simple extension of the ANOVA designs discussed in Chapter 7 to situations in which two or more dependent variables are included. As with ANOVA, MANOVA is appropriate when the independent variables are nominal or categorical and the outcomes are numerical. If the results from the MANOVA are statistically significant, using the multivariate statistic called Wilks' lambda, followup ANOVAs may be done to investigate the individual outcomes
Weiner and Rudy (2002) wanted to identify nursing home resident and staff attitudes that are barriers to effective pain management. They collected information from nurses, nursing assistants, and residents in seven longterm care facilities. They designed questionnaires to collect beliefs about 12 components of chronic pain management and administered them to these three groups. They wanted to know if there were attitudinal differences among the three groups on the 12 components. If analysis of variance is used in this study, they would need to do 12 different ANOVAs, and the probability of any one component being significant by chance is increased. With these multiple dependent variables, they correctly chose to use MANOVA. Results indicated that residents believed that chronic pain does not change, and they were fearful of addiction. The nursing staff believed that many complaints were unheard by busy staff. Note that this study used a nested design (patients and staff within nursing homes) and would be a candidate for GEE or multilevel model analysis.
The motivation for doing MANOVA prior to univariate ANOVA is similar to the reason for performing univariate ANOVA prior to t tests: to eliminate doing many significance tests and increasing the likelihood that a chance difference is declared significant. In addition, MANOVA permits the statistician to look at complex relationships among the dependent variables. The results from MANOVA are often difficult to interpret, however, and it is used sparingly in the medical literature.
Canonical Correlation Analysis
Canonical correlation analysis also involves both multiple independent and multiple dependent variables. This method is appropriate whenboth the independent variables and the outcomes are numerical, and the research question focuses on the relationship between the set of independent variables and the set of dependent variables. For example, suppose researchers wish to examine the overall relationship between indicators of health outcome (physical functioning, mental health, health perceptions, age, gender, etc) measured at the beginning of a study and the set of outcomes (physical functioning, mental health, social contacts, serious symptoms, etc) measured at the end of the study. Canonical correlation analysis forms a linear combination of the independent variables to predict not just a single outcome measure, but a linear combination of outcome measures. The two linear combinations of independent variables and dependent variables, each resulting in a single number (or index), are determined so the correlation between them is as large as possible. The correlation between the pair of linear combinations (or numbers or indices) is called the canonical correlation. Then, as in factor analysis, a second pair of linear combinations is derived from the residual variation after the effect of the first pair is removed, and the third pair from those remaining, and so on. The canonical coefficients in the linear combinations are interpreted in the same manner as regression coefficients in a multiple regression equation, and the canonical correlations as multiple R. Generally, the first two or three pairs of linear combinations account for sufficient variation, and they can be interpreted to gain insights about related factors or dimensions
The relationship between personality and symptoms of depression was studied in a communitybased sample of 804 individuals. Grucza and colleagues (2003) used the Temperament and Character Inventory (TCI) to assess personality and the Center for Epidemiologic Studies Depression scale (CESD) to measure symptoms of depression. Both of these questionnaires contain multiple scales or factors, and the authors used canonical correlation analysis to learn how the factors on the TCI are related to the factors on the CESD. They discovered several relationships and concluded that depression symptom severity and patterns are partially explained by personality traits.
SUMMARY OF ADVANCED METHODS
The advanced methods presented in this chapter are used in approximately 10–15% of the articles in medical and surgical journals. Unfortunately for readers of the medical literature, these methods are complex and not easy to understand, and they are not always described adequately. As with other complex statistical techniques, investigators should consult with a statistician if an advanced statistical method is planned. Table 102 gives a guide to the selection of the appropriate method(s), depending on the number independent variables and the scale on which they are measured.
EXERCISES
1. Using the following formula, verify the adjusted mean number of ventricular wall motion abnormalities in smokers and nonsmokers from the hypothetical data in the section titled, “Controlling for Confounding.” That is,
2. Blood flow through an artery measured as peak systolic velocity (PSV) increases with narrowing of the artery. The wellknown relationship between area of the arterial vessels and velocity of blood flow is important in the use of carotid Doppler measurements for grading stenosis of the artery. Alexandrov and collaborators (1997) examined 80 bifurcations in 40 patients and compared the findings from the Doppler technique with two angiographic methods of measuring carotid stenosis (the North American or NASCET [N] method and the common carotid [C or CSI] method). They investigated the fit provided by a linear equation, a quadratic equation, and a cubic equation.
a. Using data in the file “Alexandrov” on the CDROM, produce a scatterplot with PSV on the yaxis and CSI on the xaxis. How do you interpret the scatterplot?
b. Calculate the correlation between both the N and C methods and PSV. Which is most highly related to PSV?
c. Perform a multiple regression to predict PSV from CSI using linear and quadratic terms.
d. Using the regression equation, what is the predicted PSV if the measurement of angiographic stenosis using the CIS method is 60%?
3. Refer to the study by Soderstrom and coinvestigators (1997). Find the probability that a 27yearold Caucasian man who comes to the emergency department on Saturday night has a BAC ≥50 mg/dL.
4. Refer to the study by Soderstrom and coinvestigators (1997). From Table 108, find the value of the kappa statistic for the agreement between the predicted and actual number of males with unintentional injuries who have a BAC ≥ 50 mg/dL when they come to the emergency department.
5. Bale and associates (1986) performed a study to consider the physique and anthropometric variables of athletes in relation to their type and amount of training and to examine these variables as potential predictors of distance running performance. Sixty runners were divided into three groups: (1) elite runners with 10km runs in less than 30 min; (2) good runners with 10km times between 30 and 35 min, and (3) average runners with 10km times between 35 and 45 min. Anthropometric data included body density, percentage fat, percentage absolute fat, lean body mass, ponderal index, biceps and calf circumferences, humerus and femur widths, and various skinfold measures. The authors wanted to determine whether the anthropometric variables were able to differentiate between the groups of runners. What is the best method to use for this research question?
Table 1012. Regression coefficients and t test values for predicting beddays in RAND study. 



Table 1013. Values for prediction equation. 


Table 1014. Regression results for predicting depression at wave 2. 


6. Ware and collaborators (1987) reported a study of the effects on health for patients in health maintenance organizations (HMO) and for patients in feeforservice (FFS) plans. Within the FFS group, some patients were randomly assigned to receive free medical care and others shared in the costs. The health status of the adults was evaluated at the beginning and again at the end of the study. In addition, the number of days spent in bed because of poor health was determined periodically throughout the study. These measures, recorded at the beginning of the study—along with information on the participant's age, gender, income, and the system of health care to which he or she was assigned (HMO, free FFS, or pay FFS)—were the independent variables used in the study. The dependent variables were the values of these same 13 measures at the end of the study. The results from a multipleregression analysis to predict number of bed days are given in Table 1012.
Use the regression equation to predict the number of beddays during a 30day period for a 70yearold woman in the FFS pay plan who has the values on the independent variables shown in Table 1013 (asterisks [*] designate dummy variables given a value of 1 if yes and 0 if no).
7. Symptoms of depression in the elderly may be more subtle than in younger patients, but recognizing depression in the elderly is important because it can be treated. Henderson and colleagues in Australia (1997) studied a group of more than 1000 elderly, all age 70 years or older. They examined the outcome of depressive states 3–4 years after initial diagnosis to identify factors associated with persistence of depressive symptoms and to test the hypothesis that depressive symptoms in the elderly are a risk factor for dementia or cognitive decline. They used the Canberra Interview for the Elderly (CIE), which measures depressive symptoms and cognitive performance, and referred to the initial measurement as “wave 1” and the followup as “wave 2.” The regression equation predicting depression at wave 2 for 595 people who completed the interview on both occasions is given in Table 1014, and data are in the file on the CDROM entitled, “Henderson.” The variables have been entered into the regression equation in blocks, an example of hierarchical regression.
Table 1015. Cox proportional hazard model using only pretreatment variables. 


9.
a. Based on the regression equation in Table 1014, what is the relationship between depression score initially and at followup?
b. The regression coefficient for age is 0.014. Is it significant? How would you interpret it?
c. Once a person's depression score at wave 1 is known, which group of variables accounts for more of the variation in depression at wave 2?
d. Use the data on the CDROM to replicate the analysis.
10. Table 1015 contains the results from an analysis of the data from Crook and colleagues (1997) using only information known before treatment was given.
a. Is the overall Cox model significant when based on pretreatment variables only? What level of significance is reported?
b. Were any of the potentially confounding variables significant?
c. Confirm the value of the odds ratios associated with the TUMSTAGE(3) variable of T classifications, and interpret the confidence interval.
Table 1016. Case summaries.^{a} 


d. What are the major differences in this analysis compared with the one that included posttreatment variables as well?
11. Hindmarsh and Brook (1996) examined the final height of 16 short children who were treated with growth hormone. They studied several variables they thought might predict height in these children, such as the mother's height, the father's height, the child's chronologic and bone age, dose of the growth hormone during the first year, age at the start of therapy, and the peak response to an insulininduced hypoglycemia test. All anthropometric indices were expressed as standard deviation scores; these scores express height in terms of standard deviations from the mean in a norm group. For example, a height score of 2.00 indicates the child is 2 standard deviations below the mean height for his or her age group.
Data are given in Table 1016 and in a file entitled “Hindmarsh” on the CDROM.
a. Use the data to perform a stepwise regression and interpret the results. We reproduced a portion of the output in Table 1017.
b.
Table 1017. Results from stepwise multiple regression to predict final height in standard deviation scores.^{a} 


c. What variable entered the equation on the first iteration (model 1)? Why do you think it entered first?
d. What variables are in the equation at the final model? Which of these variables makes the greatest contribution to the prediction of final height?
e. Why do you think the variable that entered the equation first is not in the final model?
f. Using the regression equation, what is the predicted height of the first child? How close is this to the child's actual final height (in SDS scores)?
Footnotes
^{a}Technically it is possible for the regression coefficient and the correlation to have different signs. If so, the variable is called a moderator variable; it affects the relationship between the dependent variable and another independent variable.
^{b}The standardized coefficient = the unstandardized coefficient multiplied by the standard deviation of the X variable and divided by the standard deviation of the Y variable: β_{j} = b_{j} (SD_{X}/SD_{Y}).