KEY CONCEPTS

PRESENTING PROBLEMS
Presenting Problem 1
In the United States, according to World Health Organization (WHO) standards, 42% of men and 28% of women are overweight, and an additional 21% of men and 28% of women are obese. Body mass index (BMI) has become the measure to define standards of overweight and obesity. The WHO defines overweight as a BMI between 25 and 29.9 kg/m^{2} and obesity as a BMI greater than or equal to 30 kg/m^{2}. Jackson and colleagues (2002) point out that the use of BMI as a single standard for obesity for all adults has been recommended because it is assumed to be independent of variables such as age, sex, ethnicity, and physical activity. Their goal was to examine this assumption by evaluating the effects of sex, age, and race on the relation between BMI and measured percent fat. They studied 665 black and white men and women who ranged in age between 17 years and 65 years. Each participant was carefully measured for height and weight to calculate BMI and body density. Relative body fat (%fat) was estimated from body density using previously published equations. The independent variables examined were BMI, sex, age, and race. We examine this data to learn whether a relationship exists, and if so, whether it is linear or not. Data are on the CD in a folder entitled “Jackson.”
Presenting Problem 2
Hypertension, defined as systolic pressure greater than 140 mm Hg or diastolic pressure greater than 90 mm Hg, is present in 20–30% of the U.S. population. Recognition and treatment of hypertension has significantly reduced the morbidity and mortality associated with the complications of hypertension. A number of finger blood pressure devices are marketed for home use by patients as an easy and convenient way for them to monitor their own blood pressure.
How correct are these finger blood pressure devices? Nesselroad and colleagues (1996) studied these devices to determine their accuracy. They measured blood pressure in 100 consecutive patients presenting to a family practice office who consented to participate. After being seated for 5 min, blood pressure was measured in each patient using a standard blood pressure cuff of appropriate size and with each of three automated finger blood pressure devices. The data were analyzed by calculating the correlation coefficient between the value obtained with the blood pressure cuff and the three finger devices and by calculating the percentage of measurements with each automated device that fell within the ą 4 mm Hg margin of error of the blood pressure cuff.
We use the data to illustrate correlation and scatterplots. We also illustrate a test of hypothesis about two dependent or related correlation coefficients. Data are given in the section titled, “Spearman's Rho,” and on the CDROM in a folder called “Nesselroad.”
Presenting Problem 3
Symptoms of forgetfulness and loss of concentration can be a result of natural aging and are often aggravated by fatigue, illness, depression, visual or hearing loss, or certain medications. Hodgson and Cutler (1997) wished to examine the consequences of anticipatory dementia, a phenomenon characterized as the fear that normal and ageassociated memory change may be the harbinger of Alzheimer's disease.
They studied 25 men and women having a living parent with a probable diagnosis of Alzheimer's disease, a condition in which genetic factors are known to be important. A control group of 25 men and women who did not have a parent with dementia was selected for comparison. A directed interview and questionnaire were used to measure concern about developing Alzheimer's disease and to assess subjective memory functioning. Four measures of each individual's sense of wellbeing were used in the areas of depression, psychiatric symptomatology, life satisfaction, and subjective health status. We use this study to illustrate biserial correlation and show its concordance with the t test. Observations from the study are given in the section entitled “Predicting with the Regression Equation: Individual and Mean Values,” and the data are in a folder on the CDROM entitled “Hodgson.”
Presenting Problem 4
The study of hyperthyroid women by Gonzalo and coinvestigators (1996) was a presenting problem in Chapter 7. Recall that the study reported the effect of excess body weight in hyperthyroid patients on glucose tolerance, insulin secretion, and insulin sensitivity. The study included 14 hyperthyroid women, 6 of whom were overweight, and 19 volunteers with normal thyroid levels of similar ages and weight. The investigators in this study also examined the relationship between insulin sensitivity and body mass index for hyperthyroid and control women. (See Figure 3 in the Gonzalo article) We revisit this study to calculate and compare two regression lines. Original observations are given in Chapter 7, Table 78.
AN OVERVIEW OF CORRELATION & REGRESSION
In Chapter 3 we introduced methods to describe the association or relationship between two variables. In this chapter we review these concepts and extend the idea to predicting the value of one characteristic from the other. We also present the statistical procedures used to test whether a relationship between two characteristics is significant. Two probability distributions introduced previously, the t distribution and the chisquare distribution, can be used for statistical tests in correlation and regression. As a result, you will be pleased to learn that much of the material in this chapter will be familiar to you
When the goal is merely to establish a relationship (or association) between two measures, as in these studies, the correlation coefficient (introduced in Chapter 3) is the statistic most often used. Recall that correlation is a measure of the linear relationship between two variables measured on a numerical scale.
In addition to establishing a relationship, investigators sometimes want to predict an outcome, dependent, or response, variable from anindependent, or explanatory, variable. Generally, the explanatory characteristic is the one that occurs first or is easier or less costly to measure. The statistical method of linear regression is used; this technique involves determining an equation for predicting the value of the outcome from values of the explanatory variable. One of the major differences between correlation and regression is the purpose of the analysis—whether it is merely to describe a relationship or to predict a value. Several important similarities also occur as well, including the direct relationship between the correlation coefficient and the regression coefficient. Many of the same assumptions are required for correlation and regression, and both measure the extent of a linear relationship between the two characteristics.
CORRELATION
Figure 81 illustrates several hypothetical scatterplots of data to demonstrate the relationship between the size of the correlation coefficient r and the shape of the scatterplot. When the correlation is near zero, as in Figure 81E, the pattern of plotted points is somewhat circular. When the degree of relationship is small, the pattern is more like an oval, as in Figures 8–1D and 8–1B. As the value of the correlation gets closer to either +1 or 1, as in Figure 81C, the plot has a long, narrow shape; at +1 and 1, the observations fall directly on a line, as for r = +1.0 in Figure 81A
The scatterplot in Figure 81F illustrates a situation in which a strong but nonlinear relationship exists. For example, with temperatures less than 10–15°C, a cold nerve fiber discharges few impulses; as the temperature increases, so do numbers of impulses per second until the temperature reaches about 25°C. As the temperature increases beyond 25°C, the numbers of impulses per second decrease once again, until they cease at 40–45°C. The correlation coefficient, however, measures only a linear relationship, and it has a value close to zero in this situation.
Figure 81. Scatterplots and correlations. A: r = +1.0; B: r = 0.7; C: r = 0.9; D: r = 0.4; E: r = 0.0; F: r = 0.0. 
One of the reasons for producing scatterplots of data as part of the initial analysis is to identify nonlinear relationships when they occur. Otherwise, if researchers calculate the correlation coefficient without examining the data, they can miss a strong, but nonlinear, relationship, such as the one between temperature and number of cold nerve fiber impulses.
Calculating the Correlation Coefficient
We use the study by Jackson and colleagues (2002) to extend our understanding of correlation. We assume that anyone interested in actually calculating the correlation coefficient will use a computer program, as we do in this chapter. If you are interested in a detailed illustration of the calculations, refer to Chapter 3, in the section titled, “Describing the Relationship between Two Characterisitics,” and the study by Hébert and colleagues (1997)
Recall that the formula for the Pearson product moment correlation coefficient, symbolized by r, is
where X stands for the independent variable and Y for the outcome variable
A highly recommended first step in looking at the relationship between two numerical characteristics is to examine the relationship graphically. Figure 82 is a scatterplot of the data, with body mass index (BMI) on the Xaxis and percent body fat on the Yaxis. We see from Figure 82 that a positive relationship exists between these two characteristics: Small values for BMI are associated with small values for percent body fat. The question of interest is whether the observed relationship is statistically significant. (A large number of duplicate or overlapping data points occur in this plot because the sample size is so large.)
The extent of the relationship can be found by calculating the correlation coefficient. Using a statistical program, the correlation between BMI and percent body fat is 0.73, indicating a strong relationship between these two measures. Use the CDROM to confirm our calculations. Also, see Chapter 3, in the section titled, “Describing the Relationship between Two Characterisitics,” for a review of the properties of the correlation coefficient.
Figure 82. Scatterplot of body mass index and percent body fat. (Data, used with permission, from Jackson A, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002; 26: 789–796. Plot produced with NCSS; used with permission.) 
Interpreting the Size of r
The size of the correlation required for statistical significance is, of course, related to the sample size. With a very large sample of subjects, such as 2000, even small correlations, such as 0.06, are significant. A better way to interpret the size of the correlation is to consider what it tells us about the strength of the relationship.
The Coefficient of Determination
The correlation coefficient can be squared to form the statistic called the coefficient of determination. For the subjects in the study by Jackson, the coefficient of determination is (0.73)^{2}, or 0.53. This means that 53% of the variation in the values for one of the measures, such as percent body fat, may be accounted for by knowing the BMI. This concept is demonstrated by the Venn diagrams in Figure 83. For the left diagram, r^{2} = 0.25; so 25% of the variation in A is accounted for by knowing B (or vice versa). The middle diagram illustrates r^{2} = 0.50, similar to the value we observed, and the diagram on the right represents r^{2} = 0.80
The coefficient of determination tells us how strong the relationship really is. In the health literature, confidence limits or results of a statistical test for significance of the correlation coefficient are also commonly presented.
The t Test for Correlation
The symbol for the correlation coefficient in the population (the population parameter) is ρ (lower case Greek letter rho). In a random sample, ρ is estimated by r. If several random samples of the same size are selected from a given population and the correlation coefficient r is calculated for each, we expect r to vary from one sample to another but to follow some sort of distribution about the population value ?r?. Unfortunately, the sampling distribution of the correlation does not behave as nicely as the sampling distribution of the mean, which is normally distributed for large samples
Figure 83. Illustration of r^{2}, proportion of explained variance. 
Part of the problem is a ceiling effect when the correlation approaches either 1 or +1. If the value of the population parameter is, say, 0.8, the sample values can exceed 0.8 only up to 1.0, but they can be less than 0.8 all the way to 1.0. The maximum value of 1.0 acts like a ceiling, keeping the sample values from varying as much above 0.8 as below it, and the result is a skewed distribution. When the population parameter is hypothesized to be zero, however, the ceiling effects are equal, and the sample values are approximately distributed according to the t distribution, which can be used to test the hypothesis that the true value of the population parameter ?r? is equal to zero. The following mathematical expression involving the correlation coefficient, often called the t ratio, has been found to have at distribution with n – 2 degrees of freedom:
Let us use this t ratio to test whether the observed value of r = 0.73 is sufficient evidence with 655 observations to conclude that the true population value of the correlation ρ is different from zero.
Step 1: H_{0}: No relationship exists between BMI and percent body fat; or, the true correlation is zero: ρ = 0
H_{1}: A relationship does exist between BMI and percent body fat; or, the true correlation is not zero: ρ ≠ 0.
Step 2: Because the null hypothesis is a test of whether ρ is zero, the t ratio may be used when the assumptions for correlation (see the section titled, “Assumptions in Correlation”) are met.
Step 3: Let us use α = 0.01 for this example.
Step 4: The degrees of freedom are n – 2 = 655 – 2 = 653. The value of a t distribution with 653 degrees of freedom that divides the area into the central 99% and the upper and lower 1% is approximately 2.617 (using the value for 120 df in Table A–3). We therefore reject the null hypothesis of zero correlation if (the absolute value of) the observed value oft is greater than 2.617.
Step 5: The calculation is
Step 6: The observed value of the t ratio with 653 degrees of freedom is 27.29, far greater than 2.617. The null hypothesis of zero correlation is therefore rejected, and we conclude that the relationship between BMI and percent body fat is large enough to conclude that these two variables are associated.
Fisher's z Transformation to Test the Correlation
Investigators generally want to know whether ρ = 0, and this test can easily be done with computer programs. Occasionally, however, interest lies in whether the correlation is equal to a specific value other than zero. For example, consider a diagnostic test that gives accurate numerical values but is invasive and somewhat risky for the patient. If someone develops an alternative testing procedure, it is important to show that the new procedure is as accurate as the test in current use. The approach is to select a sample of patients and perform both the current test and the new procedure on each patient and then calculate the correlation coefficient between the two testing procedures
Either a test of hypothesis can be performed to show that the correlation is greater than a given value, or a confidence interval about the observed correlation can be calculated. In either case, we use a procedure called Fisher's z transformation to test any null hypothesis about the correlation as well as to form confidence intervals.
To use Fisher's exact test, we first transform the correlation and then use the standard normal (z) distribution. We need to transform the correlation because, as we mentioned earlier, the distribution of sample values of the correlation is skewed when ρ ≠ 0. Although this method is a bit complicated, it is actually more flexible than the t test, because it permits us to test any null hypothesis, not simply that the correlation is zero. Fisher's z transformation was proposed by the same statistician (Ronald Fisher) who developed Fisher's exact test for 2 × 2 contingency tables (discussed in Chapter 6).
Fisher's z transformation is
where ln represents the natural logarithm. Table A–6 gives the z transformation for different values of r, so we do not actually need to use the formula. With moderatesized samples, this transformation follows a normal distribution, and the following expression for the z test can be used:
To illustrate Fisher's z transformation for testing the significance of ρ, we evaluate the relationship between BMI and percent body fat (Jackson et al, 2002). The observed correlation between these two measures was 0.73. Jackson and his colleagues may have expected a sizable correlation between these two measures; let us suppose they want to know whether the correlation is significantly greater than 0.65. A onetailed test of the null hypothesis that ρ ≤ 0.65, which they hope to reject, may be carried out as follows.
Step 1: H_{0}: The relationship between BMI and percent body fat is ≤0.65; or, the true correlation ρ ≤ 0.65
H_{1}: The relationship between BMI and percent body fat is >0.65; or, the true correlation ρ > 0.65.
Step 2: Fisher's z transformation may be used with the correlation coefficient to test any hypothesis.
Step 3: Let us again use α = 0.01 for this example.
Step 4: The alternative hypothesis specifies a onetailed test. The value of the z distribution that divides the area into the lower 99% and the upper 1% is approximately 2.326 (from Table A–2). We therefore reject the null hypothesis that the correlation is ≤ 0.65 if the observed value of z is > 2.326.
Step 5: The first step is to find the transformed values for r = 0.73 and ρ = 0.65 from Table A–6; these values are 0.929 and 0.775, respectively. Then, the calculation for the z test is
Step 6: The observed value of the z statistic, 3.93, exceeds 2.358. The null hypothesis that the correlation is 0.65 or less is rejected, and the investigators can be assured that the relationship between BMI and body fat is greater than 0.65.
Confidence Interval for the Correlation
A major advantage of Fisher's z transformation is that confidence intervals can be formed. The transformed value of the correlation is used to calculate confidence limits in the usual manner, and then they are transformed back to values corresponding to the correlation coefficient
To illustrate, we calculate a 95% confidence interval for the correlation coefficient 0.73 in Jackson and colleagues (2002). We use Fisher's ztransformation of 0.73 = 0.929 and the z distribution in Table A–2 to find the critical value for 95%. The confidence interval is
Transforming the limits 0.852 and 1.006 back to correlations using Table A–6 in reverse gives approximately r = 0.69 and r = 0.77 (using conservative values). Therefore, we are 95% confident that the true value of the correlation in the population is contained within this interval. Note that 0.65 is not in this interval, which is consistent with our conclusion that the observed correlation of 0.73 is different from 0.65.
Surprisingly, computer programs do not always contain routines for finding confidence limits for a correlation. We have included a Microsoft Excel program in the Calculations folder that calculates the 95% CI for a correlation.
Assumptions in Correlation
The assumptions needed to draw valid conclusions about the correlation coefficient are that the sample was randomly selected and the two variables, X and Y, vary together in a joint distribution that is normally distributed, called the bivariate normal distribution. Just because each variable is normally distributed when examined separately, however, does not guarantee that, jointly, they have a bivariate normal distribution. Some guidance is available: If either of the two variables is not normally distributed, Pearson's product moment correlation coefficient is not the most appropriate method. Instead, either one or both of the variables may be transformed so that they more closely follow a normal distribution, as discussed in Chapter 5, or the Spearman rank correlation may be calculated. This topic is discussed in the section titled, “Other Measures of Correlation.”
COMPARING TWO CORRELATION COEFFICIENTS
On occasion, investigators want to know if a difference exists between two correlation coefficients. Here are two specific instances: (1) comparing the correlations between the same two variables that have been measured in two independent groups of subjects and (2) comparing two correlations that involve a variable in common in the same group of individuals. These situations are not extremely common and not always contained in statistical programs. We designed Microsoft Excel programs; see the folder “Calculations” on the CDROM.
Comparing Correlations in Two Independent Groups
Fisher's z transformation can be used to test hypotheses or form confidence intervals about the difference between the correlations between the same two variables in two independent groups. The results of such tests are also called independent correlations. For example, Gonzalo and colleagues (1996) in Presenting Problem 4 wanted to compare the correlation between BMI and insulin sensitivity in the 14 hyperthyroid women (r = 0.775) with the correlation between BMI and insulin sensitivity in the 19 control women (r = 0.456). See Figure 84
In this situation, the value for the second group replaces z(ρ) in the numerator for the z test shown in the previous section, and l/(n – 3) is found for each group and added before taking the square root in the denominator. The test statistic is
To illustrate, the values of z from Fisher's z transformation tables (A–6) for 0.775 and 0.456 are approximately 1.033 and 0.492 (with interpolation), respectively. Note that Fisher's z transformation is the same, regardless of whether the correlation is positive or negative. Using these values, we obtain
Figure 84. Scatterplot of BMI and insulin sensitivity. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using NCSS; used with permission.). 
Assuming we choose the traditional significance level of 0.05, the value of the test statistic, 1.38, is less than the critical value, 1.96, so we do not reject the null hypothesis of equal correlations. We decide that the evidence is insufficient to conclude that the relationship between BMI and insulin sensitivity is different for hyperthyroid women from that for controls. What is a possible explanation for the lack of statistical significance? It is possible that there is no difference in the relationships between these two variables in the population. When sample sizes are small, however, as they are in this study, it is always advisable to keep in mind that the study may have low power.
Comparing Correlations with Variables in Common in the Same Group
The second situation occurs when the research question involves correlations that contain the same variable (also called dependent correlations). For example, a very natural question for Nesselroad and colleagues (1996) was whether one of the finger devices was more highly correlated with the blood pressure cuff—considered to be the gold standard—than the other two. If so, this would be a product they might wish to recommend for patients to use at home. To illustrate, we compare the diastolic reading with device 1 and the cuff (r_{XY} = 0.32) to the diastolic reading with device 2 and the cuff (r_{XZ} = 0.45)
There are several formulas for testing the difference between two dependent correlations. We present the simplest one, developed by Hotelling (1940) and described by Glass and Stanley (1970) on pages 310–311 of their book. We will show the calculations for this example but, as always, suggest that you use a computer program. The formula follows the t distribution with n – 3 degrees of freedom; it looks rather forbidding and requires the calculation of several correlations:
We designate the cuff reading as X, device 1 as Y, and device 2 as Z. We therefore want to compare r_{XY} with r_{XZ}. Both correlations involve the X, or cuff, reading, so these correlations are dependent. To use the formula, we also need to calculate the correlation between device 1 and device 2, which is r_{YZ} = 0.54. Table 81 shows the correlations needed for this formula.
The calculations are
Table 81. Correlation matrix of diastolic blood pressures in all 100 subjects. 


You know by now that the difference between these two correlations is not statistically significant because the observed value of t is 1.50, and 1.50 = 1.50 is less than the critical value of t with 97 degrees of freedom, 1.99. This conclusion corresponds to that by Nesselroad and his colleagues in which they recommended that patients be cautioned that the finger blood pressure devices may not perform as marketed.
We designed a Microsoft Excel program for these calculations as well. It is included on the CDROM in a folder called “Calculations” and is entitled “z for 2 dept r's.”
OTHER MEASURES OF CORRELATION
Several other measures of correlation are often found in the medical literature. Spearman's rho, the rank correlation introduced in Chapter 3, is used with ordinal data or in situations in which the numerical variables are not normally distributed. When a research question involves one numerical and one nominal variable, a correlation called the point–biserial correlation is used. With nominal data, the risk ratio, or kappa (κ), discussed in Chapter 5, can be used.
Spearman's Rho
Recall that the value of the correlation coefficient is markedly influenced by extreme values and thus does not provide a good description of the relationship between two variables when their distributions are skewed or contain outlying values. For example, consider the relationships among the various finger devices and the standard cuff device for measuring blood pressure from Presenting Problem 2. To illustrate, we use the first 25 subjects from this study, listed in Table 82 (see the file entitled “Nesselroad25”).
It is difficult to tell if the observations are normally distributed without looking at graphs of the data. Some statistical programs have routines to plot values against a normal distribution to help researchers decide whether a nonparametric procedure should be used. A normal probability plot for the cuff diastolic measurement is given in Figure 85. Use the CDROM to produce similar plots for the finger device measurements
When the observations are plotted on a graph, as in Figure 85, it appears that the data are not unduly skewed. This conclusion is consistent with the tests given for the normality of a distribution by NCSS. In the normal probability plot, if observations fall within the curved lines, the data can be assumed to be normally distributed.
Table 82. Data on diastolic blood pressure for the first 25 subjects. 


As we indicated in Chapter 3, a simple method for dealing with the problem of extreme observations in correlation is to rank order the data and then recalculate the correlation on ranks to obtain the nonparametric correlation called Spearman's rho, or rank correlation. To illustrate this procedure, we continue to use data on the first 25 subjects in the study on blood pressure devices (Presenting Problem 2), even though the distribution of the values does not require this procedure. Let us focus on the correlation between the cuff and device 2, which we learned was 0.45 in the section titled, “Comparing Correlations with Variables in Common in the Same Group.”
Table 83 illustrates the ranks of the diastolic readings on the first 25 subjects. Note that each variable is ranked separately; when ties occur, the average of the ranks of the tied values is used.
Figure 85. Diastolic blood pressure using cuff readings in 25 subjects. (Data, used with permission, from Nesselroad JM, Flacco VA, Phillips DM, Kruse J: Accuracy of automated finger blood pressure devices. Fam Med 1996; 28: 189–192. Output produced using NCSS; used with permission.) 
The ranks of the variables are used in the equation for the correlation coefficient, and the resulting calculation gives Spearman's rank correlation (r_{S}), also called Spearman's rho:
where R_{X} is the rank of the X variable, R_{Y} is the rank of the Y variable, and [R with right arrow above]_{X} and [R with right arrow above]_{Y} are the mean ranks for the X and Y variables, respectively. The rank correlation r_{S} may also be calculated by using other formulas, but this approximate procedure is quite good (Conover and Iman, 1981)
Calculating r_{S} for the ranked observations in Table 83 gives
The value of r_{S} is smaller than the value of Pearson's correlation; this may occur when the bivariate distribution of the two variables is not normal. The t test, as illustrated for the Pearson correlation, can be used to determine whether the Spearman rank correlation is significantly different from zero. For example, the following procedure tests whether the value of Spearman's rho in the population, symbolized ρ_{S} (Greek letter rho with subscript S denoting Spearman) differs from zero.
Table 83. Rank order of the diastolic blood pressure for the first 25 subjects. 


Step 1: H_{0}: The population value of Spearman's rho is zero; that is, ρ_{S} = 0
H_{1}: The population value of Spearman's rho is not zero; that is, ρ_{S} ≠ 0.
Step 2: Because the null hypothesis is a test of whether ρ_{S} is zero, the t ratio may be used.
Step 3: Let us use α = 0.05 for this example.
Step 4: The degrees of freedom are n – 2 = 25 – 2 = 23. The value of the t distribution with 23 degrees of freedom that divides the area into the central 95% and the upper and lower 2˝% is 2.069 (Table A–3), so we will reject the null hypothesis if (the absolute value of)the observed value of t is greater than 2.069.
Step 5: The calculation is
Step 6: The observed value of the t ratio with 23 degrees of freedom is 1.677, less than 2.069, so we do not reject the null hypothesis and conclude there is insufficient evidence that a nonparametric correlation exists between the diastolic pressure measurements made by the cuff and finger device 2
Of course, if investigators want to test only whether Spearman's rho is greater than zero—that there is a significantly positive relationship—they can use a onetailed test. For a onetailed test with α = 0.05 and 23 degrees of freedom, the critical value is 1.714, and the conclusion is the same.
It is easy to demonstrate that performing the abovementioned test on ranked data gives approximately the same results as the Spearman rho calculated the traditional way. We just used the Pearson formula on ranks and found that Spearman's rho for the sample of 25 subjects was 0.33 between the cuff measurement of diastolic pressure and finger device 2. Use the CDROM, and calculate Spearman's rho on the original data. You should also find 0.33 using the traditional methods of calculation
To summarize, Spearman's rho is appropriate when investigators want to measure the relationship between: (1) two ordinal variables, or (2) two numerical variables when one or both are not normally distributed and investigators choose not to use a data transformation (such as taking the logarithm). Spearman's rank correlation is especially appropriate when outlying values occur among the observations.
Confidence Interval for the Odds Ratio & the Relative Risk
Chapter 3 introduced the relative risk (or risk ratio) and the odds ratio as measures of relationship between two nominal characteristics. Developed by epidemiologists, these statistics are used for studies examining risks that may result in disease. To discuss the odds ratio, recall the study discussed in Chapter 3 by Ballard and colleagues (1998) that examined the use of antenatal thyrotropinreleasing hormone (TRH). Data from this study were given in Chapter 3, Table 321. We calculated the odds ratio as 1.1, meaning that an infant in the TRH group is 1.1 times more likely to develop respiratory distress syndrome than an infant in the placebo group. This finding is the opposite of what the investigators expected to find, and it is important to learn if the increased risk is statistically significant
Significance can be determined in several ways. For instance, to test the significance of the relationship between treatment (TRH versus placebo) and the development of respiratory distress syndrome, investigators may use the chisquare test discussed in Chapter 6. The chisquare test for this example is left as an exercise (see Exercise 2). An alternative chisquare test, based on the natural logarithm of the odds ratio, is also available, and it results in values close to the chisquare test illustrated in Chapter 6 (Fleiss, 1999).
More often, articles in the medical literature use confidence intervals for risk ratios or odds ratios. Ballard and colleagues reported a 95% confidence interval for the odds ratio as (0.8 to 1.5). Let us see how they found this confidence interval.
Finding confidence intervals for odds ratios is a bit more complicated than usual because these ratios are not normally distributed, so calculations require finding natural logarithms and antilogarithms. The formula for a 95% confidence interval for the odds ratio is
where exp denotes the exponential function, or antilogarithm, of the natural logarithm, ln, and a, b, c, d are the cells in a 2 × 2 table (seeTable 69 in Chapter 6). The confidence interval for the odds ratio for risk of respiratory distress syndrome in infants who were given TRH from Table 321 is
This interval contains the value of the true odds ratio with 95% confidence. If the odds are the same in each group, the value of the odds ratio is approximately 1, indicating similar risks in each group. Because the interval contains 1, we may be 95% confident that the odds ratio risk may in fact be 1; that is, insufficient evidence exists to conclude that the risk of respiratory distress increases in infants who received TRH. By the same logic, this treatment has no protective effect. Of course, 90% or 99% confidence intervals can be formed by using 1.645 or 2.575 instead of 1.96 in the preceding equation
To illustrate the confidence interval for the relative risk, we refer to the physicians' health study (Steering Committee of the Physicians' Health Study Research Group, 1989) summarized in Chapter 3 and Table 319. Recall that the relative risk for an MI in physicians taking aspirin was 0.581. The 95% confidence interval for the true value of the relative risk also involves logarithms:
Again, the values for a, b, c, d are the cells in the 2 × 2 table illustrated in Table 69. Although it is possible to include a continuity correction for the relative risk or odds ratio, it is not commonly done. Substituting values from Table 319, the 95% confidence interval for a relative risk of 0.581 is
The 95% confidence interval does not contain 1, so the evidence indicates that the use of aspirin resulted in a reduced risk for MI. For a detailed and insightful discussion of the odds ratio and its advantages and disadvantages, see Feinstein (1985, Chapter 20) and Fleiss (1999, Chapter 5); for a discussion of the odds ratio and the risk ratio, see Greenberg and coworkers (2002, Chapters 8 and 9).
The folder containing Microsoft Excel equations on the CDROM describes two routines for finding the 95% confidence limits; they are called “CI for OR” and “CI for RR.” You may find these routines helpful if you wish to find 95% confidence limits for odds ratios or relative risks for published studies that contain the summary data for these statistics.
Measuring Relationships in Other Situations
We have discussed how to measure and test the significance of relationships by using Pearson's product moment correlation coefficient, Spearman's nonparametric procedure based on ranks, and risk or odds ratios. Not all situations are covered by these procedures, however, such as when one variable is measured on a nominal scale and the other is numerical but has been classified into categories, when one variable is nominal and the other is ordinal, or when both are ordinal but only a few categories occur. In these cases, a contingency table is formed and the chisquare test is used, as illustrated in Chapters 6 and 7
On other occasions, the numerical variable is not collapsed into categories. For example, Hodgson and Cutler (1997) studied 25 subjects who had a living parent with Alzheimer's disease and a matched group who had no family history of dementia. Subjects answered questions about their concern of developing Alzheimer's disease and completed a questionnaire designed to evaluate their concerns about memory, the Memory Assessment Index (MAI). Data are given in Table 84.
The investigators were interested in the relationship between life satisfaction and performance on the MAI. Life satisfaction was measured as yes or no, and the MAI was measured on a scale from 0 = no memory problems to 12 = negative perceptions of memory and concern about developing dementia. When one variable is binary and the other is numerical, it is possible to evaluate the relationship using a special correlation, called the point–biserial correlation. If the binary variable is coded as 0 and 1, the Pearson correlation procedure can be used to find the point–biserial correlation. Box 81A gives the results of the correlation procedure using life satisfaction and MAI. The correlation is 0.37, and the P value is 0.008633.
Did you wonder why a t test was not used to see if a difference existed in mean MAI for those who were satisfied with their life versus those who were not satisfied? If so, you are right on target because a t test is another way to look at the research question. It simply depends on whether interest focuses on a relationship or a difference. What do you think the results of a t test would show? The output from the NCSS t test procedure is given in Box 81B. Of special interest is the P value (0.008633); it is the same as for the correlation. This illustrates an important principle: The point–biserial correlation between a binary variable and a numerical variable has the same level of significance as does a t test in which the groups are defined by the binary variable.
The point–biserial correlation is often used by test developers to help evaluate the questions on the test.
For example, the National Board of Medical Examiners determines the point–biserial correlation between whether examinees get an item correct (a binary variable) and the examinee's score on the entire exam (a numerical variable). A positive point–biserial indicates that examinees who answer the question correctly tend to score high on the exam as a whole, whereas examinees missing the question tend to score low generally. Similarly, a negative point–biserial correlation indicates that examinees who answer the question correctly tend to score low on the exam—certainly not a desirable situation. It may be that the question is tricky or poorly worded because the better examinees are more likely to miss the question; you can see why this statistic is useful for test developers.
Table 84. Data on 50 subjects in the study on anticipatory dementia. 


LINEAR REGRESSION
Remember that when the goal is to predict the value of one characteristic from knowledge of another, the statistical method used isregression analysis. This method is also called linear regression, simple linear regression, or least squares regression. A brief review of the history of these terms is interesting and sheds some light on the nature of regression analysis
The concepts of correlation and regression were developed by Sir Francis Galton, a cousin of Charles Darwin, who studied both mathematics and medicine in the mid19th century (Walker, 1931). Galton was interested in heredity and wanted to understand why a population remains more or less the same over many generations with the “average” offspring resembling their parents; that is, why do successive generations not become more diverse. By growing sweet peas and observing the average size of seeds from parent plants of different sizes, he discovered regression, which he termed the “tendency of the ideal mean filial type to depart from the parental type, reverting to what may be roughly and perhaps fairly described as the average ancestral type.” This phenomenon is more typically known as regression toward the mean. The term “correlation” was used by Galton in his work on inheritance in terms of the “corelation” between such characteristics as heights of fathers and sons. The mathematician Karl Pearson went on to work out the theory of correlation and regression, and the correlation coefficient is named after him for this reason.
Box 81. CORRELATION AND T TEST FOR LIFE SATISFACTION AND ANTICIPATORY DEMENTIA AS MEASURED BY MAI.
A. Correlation Matrix 



B. t Test 


C. Box Plot
Figure. No caption available. 
Source: Data, used with permission, from Hodgson LG, Cutler SJ: Anticipatory dementia and wellbeing. Am J Alzheimer's Dis 1997; 12: 62–66. Output produced using NCSS; used with permission.
The term linear regression refers to the fact that correlation and regression measure only a straightline, or linear, relationship between two variables. The term “simple regression” means that only one explanatory (independent) variable is used to predict an outcome. In multiple regression, more than one independent variable is included in the prediction equation.
Least squares regression describes the mathematical method for obtaining the regression equation. The important thing to remember is that when the term “regression” is used alone, it generally means linear regression based on the least squares method. The concept behind least squares regression is described in the next section and its application is discussed in the section after that.
Least Squares Method
Several times previously in this text, we mentioned the linear nature of the pattern of points in a scatterplot. For example, in Figure 82, a straight line can be drawn through the points representing the values of BMI and percent body fat to indicate the direction of the relationship. The least squares method is a way to determine the equation of the line that provides a good fit to the points
To illustrate the method, consider the straight line in Figure 86. Elementary geometry can be used to determine the equation for any straight line. If the point where the line crosses, or intercepts, the Yaxis is denoted by a and the slope of the line by b, then the equation is
The slope of the line measures the amount Y changes each time X changes by 1 unit. If the slope is positive, Y increases as X increases; if the slope is negative, Y decreases as X increases; and vice versa. In the regression model, the slope in the population is generally symbolized by β_{1}, called the regression coefficient; and β_{0} denotes the intercept of the regression line, that is, β_{1} and β_{0} are the population parameters in regression. In most applications, the points do not fall exactly along a straight line. For this reason, the regression model contains an error term, e, which is the distance the actual values of Y depart from the regression line. Putting all this together, the regression equation is given by
When the regression equation is used to describe the relationship in the sample, it is often written as
Figure 86. Geometric interpretation of a regression line. 
For a given value of X, say X*, the predicted value of Y* is found by extending a horizontal line from the regression line to the Yaxis as inFigure 87. The difference between the actual value for Y* and the predicted value, e* = Y* – Y*′, can be used to judge how well the line fits the data points. The least squares method determines the line that minimizes the sum of the squared vertical differences between the actual and predicted values of the Y variable; ie, β_{0} and β_{1} are determined so that Σ (Y – Y′)^{2} is minimized. The formulas for β_{0} and β_{1} are found,aand in terms of the sample estimates b and a, these formulas are
Figure 87. Least squares regression line. 
Calculating the Regression Equation
In the study described in Presenting Problem 4, the investigators wanted to predict insulin sensitivity from BMI in a group of women. Original observations were given in Chapter 7, Table 78. For now we ignore the different groups of women and examine the entire sample regardless of thyroid and weight levels
Figure 88. Scatterplot of observations on body mass index and insulin sensitivity. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using NCSS; used with permission.) 
Before calculating the regression equation for these data, let us create a scatterplot and practice “guesstimating” the value of the correlation coefficient from the plot (although it is difficult to estimate the size of r accurately when the sample size is small). Figure 88 is a scatterplot with BMI score as the explanatory X variable and insulin sensitivity as the response Y variable. How large do you think the correlation is?
If we knew the correlation between BMI and insulin sensitivity, we could use it to calculate the regression equation. Because we do not, we assume the needed terms have been calculated; they are
Then,
In this example, the insulin sensitivity scores are said to be regressed on BMI scores, and the regression equation is written as Y′ = 1.5817 – 0.0433X, where Y′ is the predicted insulin sensitivity score, and X is the BMI
Figure 89 illustrates the regression line drawn through the observations. The regression equation has a positive intercept of +1.58, so that theoretically a patient with zero BMI would have an insulin sensitivity of 1.58, even though, in the present example, a zero BMI is not possible. The slope of 0.043 indicates that each time a woman's BMI increases by 1, her predicted insulin sensitivity decreases by approximately 0.043. For example, as the BMI increases from 20 to 30, insulin sensitivity decreases from about 0.73 to about 0.3. Whether the relationship between BMI and insulin sensitivity is significant is discussed in the next section.
Assumptions & Inferences in Regression
In the previous section, we worked with a sample of observations instead of the population of observations. Just as the sample mean X̅ is an estimate of the population mean ľ, the regression line determined from the formulas for a and b in the previous section is an estimate of the regression equation for the underlying population.
Figure 89. Regression of observations on body mass index and insulin sensitivity. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using NCSS; used with permission.) 
As in Chapters 6 and 7, in which we used statistical tests to determine how likely it was that the observed differences between two means occurred by chance, in regression analysis we must perform statistical tests to determine the likelihood of any observed relationship between X and Y variables. Again, the question can be approached in two ways: using hypothesis tests or forming confidence intervals. Before discussing these approaches, however, we briefly discuss the assumptions required in regression analysis.
If we are to use a regression equation, the observations must have certain properties. Thus, for each value of the X variable, the Y variable is assumed to have a normal distribution, and the mean of the distribution is assumed to be the predicted value, Y′. In addition, no matter the value of the X variable, the standard deviation of Y is assumed to be the same. These assumptions are rather like imagining a large number of individual normal distributions of the Y variable, all of the same size, one for each value of X. The assumption of this equal variation in the Y′s across the entire range of the X′s is called homogeneity, or homoscedasticity. It is analogous to the assumption of equal variances (homogeneous variances) in the t test for independent groups, as discussed in Chapter 6.
The straightline, or linear, assumption requires that the mean values of Y corresponding to various values of X fall on a straight line. The values of Y are assumed to be independent of one another. This assumption is not met when repeated measurements are made on the same subjects; that is, a subject's measure at one time is not independent from the measure of that same subject at another time. Finally, as with other statistical procedures, we assume the observations constitute a random sample from the population of interest.
Regression is a robust procedure and may be used in many situations in which the assumptions are not met, as long as the measurements are fairly reliable and the correct regression model is used. (Other regression models are discussed in Chapter 10.) Meeting the regression assumptions generally causes fewer problems in experiments or clinical trials than in observational studies because reliability of the measurements tends to be greater in experimental studies. Special procedures can be used when the assumptions are seriously violated, however; and as in ANOVA, researchers should seek a statistician's advice before using regression if questions arise about its applicability.
The Standard Error of the Estimate
Regression lines, like other statistics, can vary. After all, the regression equation computed for any one sample of observations is only an estimate of the true population regression equation. If other samples are chosen from the population and a regression equation is calculated for each sample, these equations will vary from one sample to another with respect to both their slopes and their intercepts. An estimate of this variation is symbolized S_{Y}._{X} (read s of y given x) and is called the standard error of regression, or the standard error of the estimate. It is based on the squared deviations of the predicted Y′s from the actual Y′s and is found as follows:
The computation of this formula is quite tedious; and although more userfriendly computational forms exist, we assume that you will use a computer program to calculate the standard error of the estimate. In testing both the slope and the intercept, a t test can be used, and the standard error of the estimate is part of the formula. It is also used in determining confidence limits. To present these formulas and the logic involved in testing the slope and the intercept, we illustrate the test of hypothesis for the intercept and the calculation of a confidence interval for the slope, using the BMI–insulin sensitivity regression equation.
Inference about the Intercept
To test the hypothesis that the intercept departs significantly from zero, we use the following procedure:
Step 1: H_{0}: β_{0} = 0 (The intercept is zero)
H_{1}: β_{0} ≠ 0 (The intercept is not zero)
Step 2: Because the null hypothesis is a test of whether the intercept is zero, the t ratio may be used if the assumptions are met. The tratio uses the standard error of the estimate to calculate the standard error of the intercept (the denominator of the t ratio):
Step 3: Let us use α equal to 0.05.
Step 4: The degrees of freedom are n – 2 = 33 – 2 = 31. The value of the t distribution with 31 degrees of freedom that divides the area into the central 95% and the combined upper and lower 5% is approximately 2.040 (from Table A–3). We therefore reject the null hypothesis of a zero intercept if (the absolute value of) the observed value of t is greater than 2.040.
Step 5: The calculation follows; we used a spreadsheet (Microsoft Excel) to calculate S_{Y}._{X} = 0.256 and Σ(X – X̅)^{2} = 468.015.
Step 6: The absolute value of the observed t ratio is 5.30, which is greater than 2.040. The null hypothesis of a zero intercept is therefore rejected. We conclude that the evidence is sufficient to show that the intercept is significantly different from zero for the regression of insulin sensitivity on BMI
As you know by now, it is also possible to form confidence limits for the intercept using the observed value and adding or subtracting the critical value from the t distribution multiplied by the standard error of the intercept.
Inferences about the Regression Coefficient
Instead of illustrating the hypothesis test for the population regression coefficient, let us find a 95% confidence interval for β_{1}. The interval is given by
Because the interval excludes zero, we can be 95% confident that the regression coefficient is not zero but that it is between 0.0674 and 0.0192 or between about 0.07 and 0.02. Because the regression coefficient is significantly less than zero, can the correlation coefficient be equal to zero? (see Exercise 3.) The relationship between b and r illustrated earlier and Exercise 3 should convince you of the equivalence of the results obtained with testing the significance of correlation and the regression coefficient. In fact, authors in the medical literature often perform a regression analysis and then report the P values to indicate a significant correlation coefficient.
The output from the SPSS regression program is given in Table 85. The program produces the value of t and the associated P value, as well as 95% confidence limits. Do the results agree with those we found earlier? To become familiar with using regression, we suggest you replicate these results using the CDROM.
Predicting with the Regression Equation: Individual and Mean Values
One of the important reasons for obtaining a regression equation is to predict future values for a group of subjects (or for individual subjects). For example, a clinician may want to predict insulin sensitivity from BMI for a group of women with newly diagnosed diabetes. Or the clinician may wish to predict the sensitivity for a particular woman. In either case, the variability associated with the regression line must be reflected in the prediction. The 95% confidence interval for a predicted mean Y in a group of subjects is
The 95% confidence interval for predicting a single observation is
Table 85. Computer output of regression of insulin sensitivity on body mass index. 


Comparing these two formulas, we see that the confidence interval predicting a single observation is wider than the interval for the mean of a group of individuals; 1 is added to the standard error term for the individual case. This result makes sense, because for a given value ofX, the variation in the scores of individuals is greater than that in the mean scores of groups of individuals. Note also that the numerator of the third term in the standard error is the squared deviation of X from X̅. The size of the standard error therefore depends on how close the observation is to the mean; the closer X is to its mean, the more accurate is the prediction of Y. For values of X quite far from the mean, the variability in predicting the Y score is considerable. You can appreciate why it is difficult for economists and others who wish to predict future events to be very accurate!
Table 86 gives 95% confidence intervals associated with predicted mean insulin sensitivity levels and predicted insulin sensitivity levels for an individual corresponding to several different BMI values (and for the mean BMI in this sample of 33 women). Several insights about regression analysis can be gained by examining this table. First, note the differences in magnitude between the standard errors associated with the predicted mean insulin sensitivity and those associated with individual insulin sensitivity levels: The standard errors are much larger when we predict individual values than when we predict the mean value. In fact, the standard error for individuals is always larger than the standard error for means because of the additional 1 in the formula. Also note that the standard errors take on their smallest values when the observation of interest is the mean (BMI of 24.921 in our example). As the observation departs in either direction from the mean, the standard errors and confidence intervals become increasingly larger, reflecting the squared difference between the observation and the mean. If the confidence intervals are plotted as confidence bands about the regression line, they are closest to the line at the mean of X and curve away from it in both directions on each side of X̅. Figure 810 shows the graph of the confidence bands.
Table 86. 95% Confidence intervals for predicted mean insulin sensitivity levels and predicted individual insulin sensitivity levels. 


Table 86 illustrates another interesting feature of the regression equation. When the mean of X is used in the regression equation, the predicted Y′ is the mean of Y. The regression line therefore goes through the mean of X and the mean of Y.
Now we can see why confidence bands about the regression line are curved. The error in the intercept means that the true regression line can be either above or below the line calculated for the sample observations, although it maintains the same orientation (slope). The error in measuring the slope therefore means that the true regression line can rotate about the point (X̅,[Y with bar above]) to a certain degree. The combination of these two errors results in the concave confidence bands illustrated in Figure 810. Sometimes journal articles have regression lines with confidence bands that are parallel rather than curved. These confidence bands are incorrect, although they may correspond to standard errors or to confidence intervals at their narrowest distance from the regression line.
Comparing Two Regression Lines
Sometimes investigators wish to compare two regression lines to see whether they are the same. For example, the investigators in Presenting Problem 4 were particularly interested in the relationship between BMI and insulin sensitivity in women who were hyperthyroid versus those whose thyroid levels were normal. The investigators determined separate regression lines for these two groups of women and reported them in Figure 3 of their article. We reproduced their regression lines in Figure 811
As you might guess, researchers are often interested in comparing regression lines to learn whether the relationships are the same in different groups of subjects. When we compare two regression lines, four situations can occur, as illustrated in Figure 812. In Figure 812A, the slopes of the regression lines are the same, but the intercepts differ. This situation occurs, for instance, in blood pressure measurements regressed on age in men and women; that is, the relationship between blood pressure and age is similar for men and women (equal slopes), but men tend to have higher blood pressure levels at all ages than women (higher intercept for men).
Figure 810. Regression of observations on body mass index and insulin sensitivity with confidence bands (heavy lines for means, light lines for individuals). (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using NCSS; used with permission.) 
Figure 811. Separate regression lines for hyperthyroid (squares) and control (circles) women. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using NCSS; used with permission.) 
Figure 812. Illustration of ways regression lines can differ. A: Equal slopes and different intercepts. B: Equal intercepts and different slopes. C: Different slopes and different intercepts. D: Equal slopes and equal intercepts. 
In Figure 812B, the intercepts are equal, but the slopes differ. This pattern may describe, say, the regression of platelet count on number of days following bone marrow transplantation in two groups of patients: those for whom adjuvant therapy results in remission of the underlying disease and those for whom the disease remains active. In other words, prior to and immediately after transplantation, the platelet count is similar for both groups (equal intercepts), but at some time after transplantation, the platelet count remains steady for patients in remission and begins to decrease for patients not in remission (more negative slope for patients with active disease).
In Figure 812C, both the intercepts and the slopes of the regression lines differ. The investigators in Presenting Problem 4 reported a steeper decline in the slope of insulin sensitivity as the BMI increased in the hyperthyroid women than in the control group.b Although they did not specifically address any difference in intercepts, the relationship between BMI and insulin sensitivity resembles the situation inFigure 812C.
If no differences exist in the relationships between the predictor and outcome variables, the regression lines are similar to Figure 812D, in which the lines are coincident: Both intercepts and slopes are equal. This situation occurs in many situations in medicine and is considered to be the expected pattern (the null hypothesis) until it is shown not to apply by testing hypotheses or forming confidence limits for the intercept and or slope (or both intercept and slope).
From the four situations illustrated in Figure 812, we can see that three statistical questions need to be asked:
1. Are the slopes equal?
2. Are the intercepts equal?
3. Are both the slopes and the intercepts equal?
Statistical tests based on the t distribution can be used to answer the first two questions; these tests are illustrated in Kleinbaum and associates (1997). The authors point out, however, that the preferred approach is to use regression models for more than one independent variable—a procedure called multiple regression—to answer these questions. The procedure consists of pooling observations from both samples of subjects (eg, observations on both hyperthyroid and control women) and computing one regression line for the combined data. Other regression coefficients indicate whether it matters to which group the observations belong. The simplest model is then selected. Because the regression lines were statistically different, the Gonzalo and colleagues reported two separate regression equations.
USE OF CORRELATION & REGRESSION
Some of the characteristics of correlation and regression have been noted throughout the discussions in this chapter, and we recapitulate them here as well as mention other features. An important point to reemphasize is that correlation and regression describe only linear relationships. If correlation coefficients or regression equations are calculated blindly, without examining plots of the data, investigators can miss very strong, but nonlinear relationships.
Analysis of Residuals
A procedure useful in evaluating the fit of the regression equation is the analysis of residuals (Pedhazur, 1997). We calculatedresiduals when we found the difference between the actual value of Y and the predicted value of Y′, or Y – Y′, although we did not use the term. A residual is the part of Y that is not predicted by X (the part left over, or the residual). The residual values on the Yaxis are plotted against the X values on the Xaxis. The mean of the residuals is zero, and, because the slope has been subtracted in the process of calculating the residuals, the correlation between them and the X values should be zero
Stated another way, if the regression model provides a good fit to the data, as in Figure 813A, the values of the residuals are not related to the values of X. A plot of the residuals and the X values in this situation should resemble a scatter of points corresponding to Figure 813Bin which no correlation exists between the residuals and the values of X. If, in contrast, a curvilinear relationship occurs between Y and X, such as in Figure 813C, the residuals are negative for both small values and large values of X, because the corresponding values of Y fall below a regression line drawn through the data. They are positive, however, for midsized values of X because the corresponding values of Yfall above the regression line. In this case, instead of obtaining a random scatter, we get a plot like the curve in Figure 813D, with the values of the residuals being related to the values of X. Other patterns can be used by statisticians to help diagnose problems, such as a lack of equal variances or various types of nonlinearity.
Use the CDROM and the regression program to produce a graph of residuals for the data in Presenting Problem 4. Which of the four situations in Figure 813 is most likely? See Exercise 8.
Dealing with Nonlinear Observations
Several alternative actions can be taken if serious problems arise with nonlinearity of data. As we discussed previously, a transformationmay make the relationship linear, and regular regression methods can then be used on the transformed data. Another possibility, especially for a curve, is to fit a straight line to one part of the curve and a second straight line to another part of the curve, a procedure calledpiecewise linear regression. In this situation, one regression equation is used with all values of X less than a given value, and the second equation is used with all values of X greater than the given value. A third strategy, also useful for curves, is to perform polynomial regression; this technique is discussed in Chapter 10. Finally, more complex approaches called nonlinear regression may be used (Snedecor and Cochran, 1989).
Regression toward the Mean
The phenomenon called regression toward the mean often occurs in applied research and may go unrecognized. A good illustration of regression toward the mean occurred in the MRFIT study (Multiple Risk Factor Intervention Trial Research Group; Gotto, 1982), which was designed to evaluate the effect of diet and exercise on blood pressure in men with mild hypertension. To be eligible to participate in the study, men had to have a diastolic blood pressure of ≥90 mm Hg. The eligible subjects were then assigned to either the treatment arm of the study, consisting of programs to encourage appropriate diet and exercise, or the control arm, consisting of typical care. This study has been called a landmark trial and was reprinted in 1997 in the Journal of the American Medical Association. See Exercise 13
To illustrate the concept of regression toward to the mean, we consider the hypothetical data in Table 87 for diastolic blood pressure in 12 men. If these men were being screened for the MRFIT study, only subjects 7 through 12 would be accepted; subjects 1 through 6 would not be eligible because their baseline diastolic pressure is <90 mm Hg. Suppose all subjects had another blood pressure measurement some time later. Because a person's blood pressure varies considerably from one reading to another, about half the men can be expected to have higher blood pressures and about half to have lower blood pressures, owing to random variation. Regression toward the mean tells us that those men who had lower pressures on the first reading are more likely to have higher pressures on the second reading. Similarly, men who had a diastolic blood pressure ≥ 90 on the first reading are more likely to have lower pressures on the second reading. If the entire sample of men is remeasured, the increases and decreases tend to cancel each other. If, however, only a subset of the subjects is examined again, for example, the men with initial diastolic pressures > 90, the blood pressures will appear to have dropped, when in fact they have not.
Figure 813. Illustration of analysis of residuals. A: Linear relationship between X and Y. B: Residuals versus values of X for relation in part A. C: Curvilinear relationship between X and Y. D: Residuals versus values of X for relation in part C. 

Table 87. Hypothetical data on diastolic blood pressure to illustrate regression toward the mean. 


Regression toward the mean can result in a treatment or procedure appearing to be of value when it has had no actual effect; the use of a control group helps to guard against this effect. The investigators in the MRFIT study were aware of the problem of regression toward the mean and discussed precautions they took to reduce its effect.
Common Errors in Regression
One error in regression analysis occurs when multiple observations on the same subject are treated as though they were independent. For example, consider ten patients who have their weight and skinfold measurements recorded prior to beginning a lowcalorie diet. We may reasonably expect a moderately positive relationship between weight and skinfold thickness. Now suppose that the same ten patients are
P.212
weighed and measured again after 6 weeks on the diet. If all 20 pairs of weight and skinfold measurements are treated as though they were independent, several problems occur. First, the sample size will appear to be 20 instead of 10, and we are more likely to conclude significance. Second, because the relationship between weight and skinfold thickness in the same person is somewhat stable across minor shifts in weight, using both before and after diet observations has the same effect as using duplicate measures, and this results in a correlation larger than it should be
The magnitude of the correlation can also be erroneously increased by combining two different groups. For example, consider the relationship between height and weight. Suppose the heights and weights of ten men and ten women are recorded, and the correlation between height and weight is calculated for the combined samples. Figure 814 illustrates how the scatterplot might look and indicates the problem that results from combining men and women in one sample. The relationship between height and weight appears to be more significant in the combined sample than it is when measured in men and women separately. Much of the apparent significance results because men tend both to weigh more and to be taller than women. Inappropriate conclusions may result from mixing two different populations—a rather common error to watch for in the medical literature.
Comparing Correlation & Regression
Correlation and regression have some similarities and some differences. First, correlation is scaleindependent, but regression is not; that is, the correlation between two characteristics, such as height and weight, is the same whether height is measured in centimeters or inches and weight in kilograms or pounds. The regression equation predicting weight from height, however, depends on which scales are being used; that is, predicting weight measured in kilograms from height measured in centimeters gives different values for a and b than if predicting weight in pounds from height in inches
Figure 814. Hypothetical data illustrating spurious correlation. 
An important consequence of scale independence in correlation is that the correlation between X and Y is the same as the correlation between Y′ and Y. They are equal because the regression equation itself, Y′ = a + bX, is a simple rescaling of the X variable; that is, each value of X is multiplied by a constant value b and then the constant a is added. The fact that the correlation between the original variables Xand Y is equal to the correlation between Y and Y′ provides a useful alternative for testing the significance of the regression, as we will see in Chapter 10. Finally, the slope of the regression line has the same sign (+ or –) as the correlation coefficient (see Exercise 10). If the correlation is zero, the regression line is horizontal with a slope of zero. Thus, the formulas for the correlation coefficient and the regression coefficient are closely related. If r has already been calculated, it can be multiplied by the ratio of the standard deviation of Y to the standard deviation of X, SD_{Y}/SD_{X} to obtain b (see Exercise 9). Thus,
Similarly, if the regression coefficient is known, r can be found by
Multiple Regression
Multiple regression analysis is a straightforward generalization of simple regression for applications in which two or more independent (explanatory) variables are used to predict an outcome. For example, in the study described in Presenting Problem 4, the investigators wanted to predict a woman's insulin sensitivity level based on her BMI. They also wanted to control for the age of the woman, however. The results from two analyses are given in Table 88. First, regression was done using the BMI to predict insulin sensitivity among hyperthyroid women; the resulting equation was
Table 88. Regression equations for hyperthyroid women using BMI versus BMI and age as predictor variables. 



Figure 815. Illustration of sample size program nQuery Advisor. (Data, used with permission, from Jackson A, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002; 26: 789–796. Figure produced using nQuery Advisor; used with permission.) 
Next, the regression was repeated using both BMI and age as independent variables. The results were
As you can see, the addition of the age variable has relatively little effect; in fact, the P value for age is 0.30, indicating that age is not significantly associated with insulin sensitivity in this group of hyperthyroid women.
As an additional point, note that R^{2} (called Rsquared) is 0.601 for the first regression equation in Table 88. R^{2} is interpreted in the same manner as the coefficient of determination, r^{2}, discussed in the section titled, “Interpreting the Size of r.” This topic, along with multiple regression and other statistical methods based on regression, is discussed in detail in Chapter 10.
Figure 816. Illustration of setup for using the PASS sample size program for multiple regression using the data on insulin sensitivity. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using PASS; used with permission.) 
SAMPLE SIZES FOR CORRELATION & REGRESSION
As with other statistical procedures, it is important to have an adequate number of subjects in any study that involves correlation or regression. Complex formulas are required to estimate sample sizes for these procedures, but fortunately we can use statistical power programs to do the calculations
Suppose that Jackson and colleagues (2002) wanted to know what sample size would be necessary to produce a confidence interval for the correlation of BMI and percent body fat that would be within ą 0.10 from an expectant correlation coefficient of 0.75. In other words, how many subjects are needed for a 95% confidence interval from 0.65 to 0.85, assuming they observe a correlation of 0.75 (recall they actually found a correlation of 0.73)? We used the nQuery Advisor program to illustrate the sample size needed in this situation; the output is given in Figure 815. A sample of 102 patients would be necessary. nQuery produces only a onesided interval, so we used 97.5% to obtain a 95% twosided interval. We could have used the upper limit of 0.85 instead of the lower limit 0.65 (line 3 of the nQuery table). Do you think the sample size would be the same? Try it and see.
Figure 817. Illustration of the PASS sample size program for multiple regression using the data on insulin sensitivity. (Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, HerreraPombo JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity and glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996; 45: 689–697. Output produced using PASS; used with permission.) 
To illustrate the power analysis for regression, consider the regression equation to predict insulin sensitivity from BMI (Gonzalo et al, 1996). Recall that we found that a 95% confidence interval for the regression coefficient was between 0.0674 and 0.0192 in the entire sample of 33 women. Suppose Gonzalo and colleagues wanted to know how many women would be needed for the regression. The power program PASS finds the sample size by estimating the number needed to obtain a given value for R^{2} (or r^{2} when only one independent variable is used). We assume they want the correlation between the actual insulin sensitivity and the predicted sensitivity to be at least 0.50, producing an r^{2}of 0.25. The setup and output from the PASS program are given in Figures 8–16 and 8–17. From Figure 817, we see that a sample size of about 26 is needed in each group for which a regression equation is to be determined.
SUMMARY
Four presenting problems were used in this chapter to illustrate the application of correlation and regression in medical studies. The findings from the study described in Presenting Problem 1 demonstrate the relationship between BMI and percent body fat, a correlation equal to 0.73. The authors reported that the relationship was nonlinear, which can be seen in Figure 82. Several factors other than BMI affected the relationship. The authors
P.216
concluded that BMI is only a moderate predictor of percent body fat, and it is important to consider age and gender when defining the prevalence of obesity with BMI for populations of American men and women
In Presenting Problem 2, Nesselroad and colleagues (1996) evaluated three automated finger blood pressure devices marketed as being accurate devices for monitoring blood pressure. We examined the relationship among these devices and the standard method using a blood pressure cuff. The observed correlations were quite low, ranging from 0.32 to 0.45. We compared these two correlation coefficients and concluded that no statistical difference exists between them. Nesselroad also reported that the automated finger device measurements were outside of the ą 4 mm Hg range obtained with the standard blood pressure cuff 75–81% of the time. These researchers appropriately concluded that people who want to monitor their blood pressure cannot trust these devices to be accurate.
Hodgson and Cutler (1997) reported results from their study of people's fears that normal ageassociated memory change is a precursor of dementia. We examined the relationship between memory scores and whether people reported they were satisfied with their life. We demonstrated that the conclusions from computing the biserial correlation (the correlation between a numerical and a binary measure) and performing a t test are the same. Other results showed that the sense of wellbeing in these individuals is related to anticipatory dementia. Those with higher levels of anticipatory dementia are more depressed, have more psychiatric symptoms, have lower life satisfaction, and describe their health as being poorer than individuals not concerned about memory loss and Alzheimer's disease. Furthermore, women in the study demonstrated a relationship between anticipatory dementia and wellbeing that was not observed in men.
Data from Gonzalo and colleagues (1996) was used to illustrate regression, specifically the relationship between insulin sensitivity and BMI for hyperthyroid and control women. We found separate regression lines for hyperthyroid and for control women and observed that the relationships between insulin sensitivity and BMI are different in these two groups of women. The investigators also reported that overall glucose tolerance was not affected by hyperthyroidism in normal weight women.
The flowcharts for Appendix C summarize the methods for measuring an association between two characteristics measured on the same subjects. Flowchart C–4 indicates how the methods depend on the scale of measurement for the variables, and flowchart C–5 shows applicable methods for testing differences in correlations and in regression lines.
EXERCISES
1. The extent to which stool energy losses are normalized in cystic fibrosis patients receiving pancreatic enzyme replacement therapy prompted a study by Murphy and colleagues (1991). They determined the amount of energy within the stools of 20 healthy children and 20 patients with cystic fibrosis who were comparatively asymptomatic while taking capsules of pancreatin, an enzyme replacement. Weighed food intake was recorded daily for 7 days for all study participants. Over the final 3 days of the study, all stools were collected. Measures of lipid content, total nitrogen content, bacterial content, and total energy content of the stools were recorded. Data for the cystic fibrosis children are given in Table 89 and on the CDROM in a folder entitled “Murphy.”
a. Find and interpret the correlation between stool lipid and stool energy.
Table 89. Observations on stool lipid and stool energy losses in children with cystic fibrosis. 


2.
a. Figure 818 is from the study by Murphy. What is the authors' purpose in displaying this graph? What can be interpreted about the relationship between fecal lipid and fecal energy for control patients? How does that relationship compare with the relationship in patients with cystic fibrosis?
3.
a. Perform a chisquare test of the significance of the relationship between TRH and placebo and the subsequent development of respiratory distress syndrome using the data in Chapter 3, Table 321.
b. Determine 95% confidence limits for the relative risk of 2.3 for the risk of death within 28 days of delivery among infants not at risk using the data in Table 320. What is your conclusion?
4. Calculate the correlation between BMI and insulin sensitivity for the entire sample of 33 women, using the results in the section titled, “Calculating the Regression Equation,” for b. The standard deviation of BMI is 3.82 and of insulin sensitivity is 0.030.
5. Goldsmith and colleagues (1985) examined 35 patients with hemophilia to determine whether a relationship exists between impaired cellmediated immunity and the amount of factor concentrate used. In one of their studies, the ratio of OKT4 (helper T cells) to OKT8 (suppressor/cytotoxic T cells) was formed, and the logarithm of this ratio was regressed on the logarithm of lifetime concentrate use (Figure 819).
Figure 818. Stool lipid versus stool energy losses for the control subjects and cystic fibrosis patients. (Reproduced, with permission, from Figure 3 in Murphy JL, Wootton SA, Bond SA, Jackson AA: Energy content of stools in normal healthy controls and patients with cystic fibrosis. Arch Dis Child 1991; 66: 495–500.) 
a. Why is the logarithm scale used for both variables?
b. Interpret the correlation.
c. What do the confidence bands mean?
6. Helmrich and coworkers (1987) conducted a study to assess the risk of deep vein thrombosis and pulmonary embolism in relation to the use of oral contraceptives. They were especially interested in the risk associated with low dosage (<50 ľg estrogen) and confined their study to women under the age of 50 years. They administered standard questionnaires to women admitted to the hospital for deep vein thrombosis or pulmonary embolism as well as to a control set of women admitted for trauma and upper respiratory infections to determine their history and current use of oral contraceptives. Twenty of the 61 cases and 121 of the 1278 controls had used oral contraceptives in the previous month.
Figure 819. Regression of logarithm of OKT4:OKT8 on logarithm of factor concentrate use. (Reproduced, with permission, from Goldsmith JM, Kalish SB, Green D, Chmiel JS, Wallemark CB, Phair JP: Sequential clinical and immunologic abnormalities in hemophiliacs. Arch Intern Med 1985; 145: 431–434.) 
a. What research design was used in this study?
b. Find 95% confidence limits for the odds ratio for these data.
c. The authors reported an ageadjusted odds ratio of 8.1 with 95% confidence limits of 3.7 and 18. Interpret these results.
7. Presenting Problem 2 in Chapter 3 by Hébert and colleagues (1997) measured disability and functional changes in 655 residents of a community in Quebec, Canada. The Functional Autonomy Measurement System (SMAF), a 29item rating scale measuring functional disability in five areas, was a major instrument used in the study. We used observations on mental ability for women 85 years or older at baseline and 2 years later to illustrate the correlation coefficient in Chapter 3 and found it to be 0.58. Use the data on the CDROM and select or filter those subjects with sex = 0 and age ≥85; 51 subjects should remain for the analysis.
a. Form a 95% confidence interval for this correlation.
b. Calculate the sample size needed to produce a confidence interval for the correlation of the mental ability scores at times 1 and 3 that would be within ą0.10 from the observed correlation coefficient. In other words, how many subjects are needed for a 95% confidence interval from 0.48 to 0.68 around the correlation of 0.58 found in their study?
8. The graphs in Figure 820 were published in the study by Einarsson and associates (1985).
a. Which graph exhibits the strongest relationship with age?
b. Which variable would be best predicted from a patient's age?
c. Do the relationships between the variables and age appear to be the same for men and women; that is, is it appropriate to combine the observations for men and women in the same figure?
10. Use the CDROM regression program to produce a graph of residuals for the data from Gonzalo and coworkers (1996). Which of the four situations in Figure 813 is most likely?
11. Explain why the mean of the predicted values, [Y with bar above]′, is equal to [Y with bar above].
12. Develop an intuitive argument to explain why the sign of the correlation coefficient and the sign of the slope of the regression line are the same.
13. Use the data from the “Bossi” file (Presenting Problem in Chapter 3) to form a 2 × 2 contingency table for the frequencies of hematuria (hematur) and whether patients had RBC units > 5 (gt5rbc). The odds ratio is 1.90. Form 95% confidence limits for the odds ratio and compare them to those calculated by the statistical program. What is the conclusion?
14. Group Exercise. The causes and pathogenesis of steroidresponsive nephrotic syndrome (also known as minimalchange disease) are unknown. Levinsky and colleagues (1978) postulated that this disease might have an immunologic basis because it may be associated with atopy, recent immunizations, or a recent upper respiratory infection. It is also responsive to corticosteroid treatment. They analyzed the serum from children with steroidresponsive nephrotic syndrome for the presence of IgGcontaining immune complexes and the complementbinding properties (C1qbinding) of these complexes. For purposes of comparison, they also studied these two variables in patients with systemic lupus erythematosus. You will need to consult the published article for details of the study; a graph from the study is reproduced in Figure 821.
a. What were the study's basic research questions?
b. What was the study design? Is it the best one for the study's purposes?
c. What was the rationale in defining the kinds of patients to be studied? How were subjects obtained?
d. Interpret the correlations for the two sets of patients in Figure 821. What conclusions do you draw about the relationships between C1qbinding and IgG complexes in patients with systemic lupus erythematosus? In patients with steroidresponsive nephrotic syndrome?
e. Discuss the use of the parallel lines surrounding the regression line; do they refer to means or individuals? (Hint: The standard error of regression is 11.95 and (X – X̅^{2}) is 21,429.37).
f. Do you think the regression lines for the two sets of patients will differ?
Figure 820. Scatterplots and regression lines for relation between age and hepatic secretion of cholesterol, total bile acid synthesis, and size of cholic acid pool for women (circles) and men (squares). (Reproduced, with permission, from Einarsson K, Nilsell K, Leijd B, Angelin B: Influence of age on secretion of cholesterol and synthesis of bile acids by the liver. N Engl J Med 1985; 313: 277–282.) 
Figure 821. Scatterplot of C1qbinding complexes and IgG complexes in patients with systemic lupus erythematosus (SLE; circles) and steroidresponsive nephrotic syndrome (SRNS; squares), illustrating the possibility of differences in regression lines for SLE and SRNS patients. (Reproduced, with permission, from Levinsky RJ, Malleson PN, Barratt TM, Soothill JF: Circulating immune complexes in steroidresponsive nephrotic syndrome. N Engl J Med 1978; 298: 126–129.) 
15.
a. Would the results from this study generalize? If so, to what patient populations, and what cautions should be taken? If not, what features of the study limit its generalizability?
16. Group Exercise. The MRFIT study (Multiple Risk Factor Intervention Trial Research Group, 1982), has been called a landmark trial; it was the first largescale clinical trial, and it is rare to have a study that follows more than 300,000 men who were screened for the trial for a number of years. The Journal of the American Medical Association reprinted this article in 1997. In addition, the journal published a comment in the Landmark Perspective section by Gotto (1997). Obtain a copy of both articles.
a. What research design was used in the study?
b. Discuss the eligibility criteria. Are these criteria still relevant today?
c. What were the treatment arms? Are these treatments still relevant today?
d. What statistical methods were used? Were they appropriate? One method, the Kaplan–Meier productlimit method, is discussed inChapter 9.
e. Refer to Figure 1 in the original study. What do the lines in the figure indicate?
f. Examine the distribution of deaths given in the article's Table 4. What statistical method is relevant to analyzing these results?
g. The perspective by Gotto discusses the issue of power in the MRFIT study. How was the power of the study affected by the initial assumptions made in the study design?
Footnotes
^{a} The procedure for finding β_{0} and β_{1} involves the use of differential calculus. The partial derivatives of the preceding equations are found with respect to β_{0} and β_{1}; the two resulting equations are set equal to zero to locate the minimum values; these two equations in two unknowns, β_{0} and β_{1}, are solved simultaneously to obtain the formulas for β_{0} and β_{1}.
^{b} Gonzalo and colleagues presented regression equations after adjusting for age. We briefly discuss this procedure in the next section under Multiple Regression.