KEY CONCEPTS

PRESENTING PROBLEM
Presenting Problem 1
Kline and colleagues (2002) published a study on the safe use of Ddimer for patients seen in the emergency department with suspected pulmonary embolism (PE). We used this study in Chapter 3 to illustrate descriptive measures and graphs useful with numeric data. In this chapter we continue our analysis of some of the information collected by Kline and colleagues. We will illustrate the t test for two independent samples and learn whether there was a significant difference in pulse oximetry in patients who did and those who did not have a PE. The entire data set is in a folder on the CDROM entitled “Kline.”
Presenting Problem 2
Cryosurgery is commonly used for treatment of cervical intraepithelial neoplasia (CIN). The procedure is associated with pain and uterine cramping. Symptoms are mediated by the release of prostaglandins and endoperoxides during the thermodestruction of the cervical tissue. The most effective cryosurgical procedure, the socalled 5min double freeze, produces significantly more pain and cramping than other cryosurgical methods. It is important to make this procedure as tolerable as possible.
A study to compare the perceptions of both pain and cramping in women undergoing the procedure with and without a paracervical block was undertaken by Harper (1997). All participants received naproxen sodium 550 mg prior to surgery. Those receiving the paracervical block were injected with 1% lidocaine with epinephrine at 9 and 3 o'clock at the cervicovaginal junction to infiltrate the paracervical branches of the uterosacral nerve.
Within 10 min of completing the cryosurgical procedure, the intensity of pain and cramping were assessed on a 100mm visual analog scale (VAS), in which 0 represented no pain or cramping and 100 represented the most severe pain and cramping. Patients were enrolled in a nonrandom fashion (the first 40 women were treated without anesthetic and the next 45 with a paracervical block), and there was no placebo treatment.
We use data on intensity of cramping and pain to illustrate the t test for comparing two groups and the nonparametric Wilcoxon rank sum test. The investigator also wanted to compare the proportion of women who had no pain or cramping at the first and second freezes. We use these observations to illustrate the chisquare test. Data from the study are given in the sections titled, “Comparing Means in Two Groups with the t Test” and “Using ChiSquare Tests” and on the CDROM.
Presenting Problem 3
In Chapter 3, we briefly looked at the results from a survey to assess domestic violence (DV) education and training and the use of DV screening among pediatricians and family physicians (Lapidus et al, 2002). The survey asked physicians questions about any training they may have had, their use in screening, and their own personal history of DV. Domestic violence was defined as “past or current physical, sexual, emotional or verbal harm to a woman caused by a spouse, partner or family member.” Please see Chapter 3 for more detail. We use the data to illustrate confidence intervals for proportions. Data are given in the section titled, “Decisions About Proportions in Two Independent Groups” and on the CDROM.
PURPOSE OF THE CHAPTER
In the previous chapter, we looked at statistical methods to use when the research question involves:
1. A single group of subjects and the goal is to compare a proportion or mean to a norm or standard.
2. A single group measured twice and the goal is to estimate how much the proportion or mean changes between measurements.
The procedures in this chapter are used to examine differences between two independent groups (when knowledge of the observations for one group does not provide any information about the observations in the second group). In all instances, we assume the groups represent random samples from the larger population to which researchers want to apply the conclusions.
When the research question asks whether the means of two groups are equal (numerical observations), we can use either the twosample (independentgroups) t test or the Wilcoxon rank sum test. When the research question asks whether proportions in two independent groups are equal, we can use several methods: the z distribution to form a confidence interval and the z distribution, chisquare, or Fisher's exact test to test hypotheses.
DECISIONS ABOUT MEANS IN TWO INDEPENDENT GROUPS
Investigators often want to know if the means are equal or if one mean is larger than the other mean. For example, Kline and colleagues (2002) in Presenting Problem 1 wanted to know if information on a patient's status in the emergency department can help indicate risk for PE. We noted in Chapter 5 that the z test can be used to analyze questions involving means if the population standard deviation is known.This, however, is rarely the case in applied research, and researchers typically use the t test to analyze research questions involving two independent means.
Surveys of statistical methods used in the medical literature consistently indicate that t tests and chisquare tests are among the most commonly used. Furthermore, Williams and coworkers (1997) noted a number of problems in using the t test, including no discussion of assumptions, in more than 85% of the articles. Welch and Gabbe (1996) noted errors in using the t test when a nonparametric procedure is called for and in using the chisquare test when Fisher's exact test should be employed. Thus, being able to evaluate the use of tests comparing means and proportions—whether they are used properly and how to interpret the results—is an important skill for medical practitioners.
Comparing Two Means Using Confidence Intervals
The means and standard deviations from selected variables from the Kline study are given in Table 61. In this chapter, we analyze the pulse oximetry data for patients who had a PE and those who did not. We want to know the average difference in pulse oximetry for these two groups of patients. Pulse oximetry is a numerical variable, and we know that means provide an appropriate way to describe the average with numerical variables. We can find the mean pulse oximetry for each set of patients and form a confidence interval for the difference.
The form for a confidence interval for the difference between two means is
If we use symbols to illustrate a confidence interval for the difference between two means and let X̅_{1} stand for the mean of the first group and X̅_{2} for the mean of the second group, then we can write the difference between the two means as X̅_{1} – X̅_{2}.
Table 61. Numbers of patients, means, and standard deviations of continuous variables measured in Emergency Department patients with suspected pulmonary embolism.^{a} 


As you know from the previous chapter, the number related to the level of confidence is the critical value from the t distribution. For two means, we use the t distribution with (n_{1} – 1) degrees of freedom corresponding to the number of subjects in group 1, plus (n_{2} – 1) degrees of freedom corresponding to the number of subjects in group 2, for a total of (n_{1} + n_{2} – 2) degrees of freedom.
With two groups, we also have two standard deviations. One assumption for the t test, however, is that the standard deviations are equal (the section titled, “Assumptions for the t Distribution”). We achieve a more stable estimate of the true standard deviation in the population if we average the two separate standard deviations to obtain a pooled standard deviation based on a larger sample size. The pooled standard deviation is a weighted average of the two variances (squared standard deviations) with weights based on the sample sizes. Once we have the pooled standard deviation, we use it in the formula for the standard error of the difference between two means, the last term in the preceding equation for a confidence interval.
The standard error of the mean difference tells us how much we can expect the differences between two means to vary if a study is repeated many times. First we discuss the logic behind the standard error and then we illustrate its use with data from the study by Kline and colleagues (2002).
The formula for the pooled standard deviation looks complicated, but remember that the calculations are for illustration only, and we generally use a computer to do the computation. We first square the standard deviation in each group (SD_{1} and SD_{2}) to obtain the variance, multiply each variance by the number in that group minus 1, and add to get (n_{1} – 1) SD_{1}^{2} + (n_{2} – 1) SD_{2}^{2}. The standard deviations are based on the samples because we do not know the true population standard deviations. Next we divide by the sum of the number of subjects in each group minus 2.
Finally, we take the square root to find the pooled standard deviation.
The pooled standard deviation is used to calculate the standard error of the difference. In words, the standard error of the difference between two means is the pooled standard deviation, SD_{p}, multiplied by the square root of the sum of the reciprocals of the sample sizes. In symbols, the standard error of the mean difference is
Based on the study by Kline and colleagues (2002), a PE was positive in 181 patients and negative in 740 patients (see Table 61). Substituting 181 and 740 for the two sample sizes and 5.9 and 4.0 for the two standard deviations, we have
Does it make sense that the value of the pooled standard deviation is always between the two sample standard deviations? In fact, if the sample sizes are equal, it is the mean of the two standard deviations (see Exercise 4).
Finally, to find the standard error of the difference, we substitute 4.4 for the pooled standard deviation and 181 and 740 for the sample sizes and obtain
The standard error of the difference in pulse oximetry measured in the two groups is 0.37. The standard error is simply the standard deviation of the differences in means if we repeated the study many times. It indicates that we can expect the mean differences in a large number of similar studies to have a standard deviation of about 0.37.
Now we have all the information needed to find a confidence interval for the mean difference in pulse oximetry. From Table 61, the mean pulse oximetry levels were 95.8 for patients not having a PE and 93.4 for patients with a PE. To find the 95% confidence limits for the difference between these means (95.8 – 93.4 = 2.4), we use the twotailed value from the t distribution for 181 + 740 – 2 = 919 degrees of freedom (Appendix A3) that separates the central 95% of the t distribution from the 5% in the tails. The value is 1.96; note that the valuez is also 1.96, demonstrating once more that the t distribution approaches the shape of the z distribution with large samples.
Using these numbers in the formula for 95% confidence limits, we have 2.4 ą (1.96) (0.37) = 2.4 ą 0.73, or 1.67 to 3.13. Interpreting this confidence interval, we can be 95% confident that the interval from 1.67 to 3.13 contains the true mean difference in pulse oximetry.^{a} Because the interval does not contains the value 0, it is not likely that the mean difference is 0. Table 62 illustrates the NCSS procedure for comparing two means and determining a confidence interval (see the shaded line). The confidence interval found by NCSS is 1.66 to 3.11, slightly different from ours due to rounding. Use the data set in the Kline folder and replicate this analysis.
Recall in Chapter 3, we used box plots to examine the distribution of shock index for those with and without PE (Figure 36). Do you think the difference is statistically significant? Use the data disk and the independent groups t test. NCSS gives the 95% confidence interval as 9.375404E02 to 0.0280059. Recall that in scientific notation, we move the decimal as many digits to the left as indicated following “E,” so the interval is 0.09375404 to 0.0280059, or about 0.094 to 0.028. What can we conclude about the difference in shock index?
An “Eyeball” Test Using Error Bar Graphs
Readers of the literature and those attending presentations of research findings find it helpful if information is presented in graphs and tables, and most researchers use them whenever possible. We introduced error bar plots in Chapter 3 when we talked about different graphs that can be used to display data for two or more groups, and error bar plots can be used for an “eyeball” test of the mean in two (or more) groups. Using error bar charts with 95% confidence limits, one of the following three results always occurs:
1. The top of one error bar does not overlap with the bottom of the other error bar, as illustrated in Figure 61A. When this occurs, we can be 95% sure that the means in two groups are significantly different.
2. The top of one 95% error bar overlaps the bottom of the other so much that the mean value for one group is contained within the limits for the other group (see Figure 61B). This indicates that the two means are not statistically significant.
3. If 95% error bars overlap some but not as much as in situation 2, as in Figure 61C, we do not know if the difference is significant unless we form a confidence interval or do a statistical test for the difference between the two means.
To use the eyeball method for the mean pulse oximetry, we find the 95% confidence interval for the mean in each individual group. We can refer to Table 62 where NCSS reports the confidence interval for the mean in each group in the Descriptive Statistics Section. The 95% confidence interval for pulse oximetry in patients without a PE is 95.5 to 96.1 and 92.5 to 94.3 in patients with a PE.
These two confidence intervals are shown in Figure 62. This example illustrates the situation in Figure 61A: The graphs do not overlap, so we can conclude that mean pulse oximetry in the two groups is different.
A word of caution is needed. When the sample size in each group is greater than ten, the 95% confidence intervals are approximately equal to the mean ą2 standard errors (SE), so graphs of the mean ą2 standard errors can be used for the eyeball test. Some authors, however, instead of using the mean ą2 standard errors, present a graph of the mean ą1 standard error or the mean ą2 standard deviations (SD). Plus or minus one standard error gives only a 68% confidence interval for the mean. Plus or minus 2 standard deviations results in the 95% interval in which the individual measurements are found if the observations are normally distributed. Although nothing is inherently wrong with these graphs, they cannot be interpreted as indicating differences between means. Readers need to check graph legends very carefully before using the eyeball test to interpret published graphs.
Assumptions for the t Distribution
Three assumptions are needed to use the t distribution for either determining confidence intervals or testing hypotheses. We briefly mention them here and outline some options to use if observations do not meet the assumptions.
1. As is true with one group, the t test assumes that the observations in each group follow a normal distribution. Violating the assumption of normality gives P values that are lower than they should be, making it easier to reject the null hypothesis and conclude a difference when none really exists. At the same time, confidence intervals are narrower than they should be, so conclusions based on them may be wrong. What is the solution to the problem? Fortunately, this issue is of less concern if the sample sizes are at least 30 in each group. With smaller samples that are not normally distributed, a nonparametric procedure called the Wilcoxon rank sum test is a better choice (see the section titled, “Comparing Means with the Wilcoxon Rank Sum Test”).
Table 62. Confidence interval for pulse oximetry for patients with and without pulmonary embolism and the difference between the groups. 


2. The standard deviations (or variances) in the two samples are assumed to be equal (statisticians call them homogeneous variances). Equal variances are assumed because the null hypothesis states that the two means are equal, which is actually another way of saying that the observations in the two groups are from the same population. In the population from which they are hypothesized to come, there is only one standard deviation; therefore, the standard deviations in the two groups must be equal if the null hypothesis is true. What is the solution when the standard deviations are not equal? Fortunately, this assumption can be ignored when the sample sizes are equal (Box, 1953). This is one of several reasons many researchers try to have fairly equal numbers in each group. (Statisticians say the t test is robust with equal sample sizes.) Statistical tests can be used to decide whether standard deviations are equal before doing a t test (see the section titled, “Comparing Variation in Independent Groups”).
Figure 61. Visual assessment of differences between two independent groups, using 95% confidence limits. 
Figure 62. Illustration of error bars. (Data, used with permission of the authors and publisher, Kline JA, Nelson RD, Jackson RE, Courtney DM: Criteria for the safe use of Ddimer testing in emergency department patients with suspected pulmonary embolism: A multicenter US study. Ann Emergency Med 2002;39:144–152. Plot produced with NCSS; used with permission.) 
3. The final assumption is one of independence, meaning that knowing the values of the observations in one group tells us nothing about the observations in the other group. In contrast, consider the paired group design discussed in Chapter 5, in which knowledge of the value of an observation at the time of the first measurement does tell us something about the value at the time of the second measurement. For example, we would expect a subject who has a relatively low value at the first measurement to have a relatively low second measurement as well. For that reason, the paired t test is sometimes referred to as the dependent groups t test. No statistical test can determine whether independence has been violated, however, so the best way to ensure two groups are independent is to design and carry out the study properly.
Comparing Means in TwoGroups with the t Test
In the study on uterine cryosurgery, Harper (1997) wanted to compare the severity of pain and cramping perceived by women undergoing the usual practice of cryosurgery with that of women who received a paracervical block prior to the cryosurgery. She used a visual analog scale from 0 to 100 to represent the amount of pain or cramping, with higher scores indicating more pain or cramping. Means and standard deviations for various pain and cramping scores are reported in Table 63.
The research question is whether women who received a paracervical block prior to the cryosurgery had less severe total cramping than women who did not have a paracervical block. Stating the research question in this way implies that the researcher is interested in a directional or onetailed test, testing only whether the severity of cramping is less in the group with a paracervical block. From Table 63, the mean total cramping score is 35.60 on a scale from 0 to 100 for women who had the paracervical block versus 51.41 for women who did not. This difference could occur by chance, however, and we need to know the probability that a difference this large would occur by chance before we can conclude that these results can generalize to similar populations of women.
The sample sizes are larger than 30 and are fairly similar, so the issues of normality and equal variances are of less concern, and the t test for two independent groups can be used to answer this question. Let us designate women with a paracervical block as group 1 and those without a paracervical block as group 2. The six steps in testing the hypothesis are as follows:
Table 63. Means and standard deviations on variables from the study on paracervical block prior to cryosurgery. 


Step 1: H_{0}: Women who had a paracervical block prior to cryosurgery had a mean cramping score at least as high as women who had no block. In symbols, we express it as
H_{1}: Women who had a paracervical block prior to cryosurgery had a lower mean cramping score than women who had no block. In symbols, we express it as
Step 2: The t test can be used for this research question (assuming the observations follow a normal distribution, the standard deviations in the population are equal, and the observations are independent). The t statistic for testing the mean difference in two independent groups has the difference between the means in the numerator and the standard error of the mean difference in the denominator; in symbols it is
where there are (n_{1} – 1) + (n_{2} – 1) = (n_{1} + n_{2} – 2) degrees of freedom and SD_{p} is the pooled standard deviation. (See section titled, “Comparing Two Means Using Confidence Intervals” for details on how to calculate SD_{p}.)
Step 3: Let us use α = 0.01 so there will be only 1 chance in 100 that we will incorrectly conclude that cramping is less with cryotherapy if it really is not.
Step 4: The degrees of freedom are (n_{1} + n_{2} – 2) = 45 + 39 – 2 = 82. For a onetailed test, the critical value separating the lower 1% of thet distribution from the upper 99% is approximately 2.39 (using the more conservative value for 60 degrees of freedom in Table A–3). So, the decision is to reject the null hypothesis if the observed value of t is less than 2.39 (Figure 63).
Step 5: The calculations for the t statistic follow. First, the pooled standard deviation is 28.27 (see Exercise 2). Then the observed value fort is
Please check our calculations using the CDROM and the data set in the Harper folder.
Step 6: The observed value of t, 2.56, is less than the critical value of 2.39, so we reject the null hypothesis. In plain words, there is enough evidence in this study to conclude that, on the average, women who had a paracervical block prior to cryosurgery experienced less total cramping than women who did not have the block. Note that our conclusion refers to women on the average and does not mean that every woman with a paracervical block would experience less cramping.
Figure 63. Areas of acceptance and rejection for testing hypothesis on mean total cramping in patients with and without paracervical block (α = 0.01, onetailed). 
Comparing Variation in Independent Groups
The t test for independent groups assumes equal standard deviations or variances, called homogeneous variances, as do the analysis of variance procedures to compare more than two groups discussed in Chapter 7. We can ignore this assumption if the sample sizes are approximately equal. If not, many statisticians recommend testing to see if the standard deviations are equal. If they are not equal, the degrees of freedom for the t test can be adjusted downward, making it more difficult to reject the null hypothesis; otherwise, a nonparametric method, such as the Wilcoxon rank sum test (illustrated in the next section), can be used.
The F Test for Equal Variances
A common statistical test for the equality of two variances is called the F test. This test can be used to determine if two standard deviations are equal, because the standard deviation is the square root of the variance, and if the variances are equal, so are the standard deviations. Many computer programs calculate the F test. This test has some major shortcomings, as we discuss later on; however, an illustration is worthwhile because the F test is the statistic used to compare more than two groups (analysis of variance, the topic of Chapter 7).
To calculate the F test, the larger variance is divided by the smaller variance to obtain a ratio, and this ratio is then compared with the critical value from the F distribution (corresponding to the desired significance level). If two variances are about equal, their ratio will be about 1. If their ratio is significantly greater than 1, we conclude the variances are unequal. Note that we guaranteed the ratio is at least 1 by putting the larger variance in the numerator. How much greater than 1 does F need to be to conclude that the variances are unequal? As you might expect, the significance of F depends partly on the sample sizes, as is true with most statistical tests.
Sometimes common sense indicates no test of homogeneous variances is needed. For example, the standard deviations of the total cramping scores in the study by Harper are approximately 28.1 and 28.5, so the variances are 789.6 and 812.3. The practical significance of this difference is nil, so a statistical test for equal variances is unnecessary, and the t test is an appropriate choice. As another example, consider the standard deviations of pH from the study by Kline and colleagues (2002) given in Table 61: 0.05 for the group with a PE and 0.10 for the group without a PE. The relative difference is such that a statistical test will be helpful in deciding the best approach to analysis. The null hypothesis for the test of equal variances is that the variances are equal. Using the pH variances to illustrate the F test, 0.10^{2} = 0.01 and 0.05^{2} = 0.0025, and the F ratio is (putting the larger value in the numerator) 0.01/0.0025 = 4.
Although this ratio is greater than 1, you know by now that we must ask whether a value this large could happen by chance, assuming the variances are equal. The F distribution has two values for degrees of freedom (df): one for the numerator and one for the denominator, each equal to the sample size minus 1. The F distribution for our example has 98 – 1 = 97 df for the numerator and 320 – 1 = 319 df for the denominator. Using α = 0.05, the critical value of the F distribution from Table A–4 is approximately 1.43. (Because of limitations of the table, we used 60 df for the numerator and 120 df for the denominator, resulting in a conservative value.) Because the result of the F test is 4.00, greater than 1.43, we reject the null hypothesis of equal variances. Figure 64 shows a graph of the F distribution to illustrate this hypothesis test.
If the F test is significant and the hypothesis of equal variances is rejected, the standard deviations from the two samples cannot be pooled for the t test because pooling assumes they are equal. When this happens, one approach is to use separate variances and decrease the degrees of freedom for the t test. Reducing the degrees of freedom requires a larger observed value for t in order to reject the null hypothesis; in other words, a larger difference between the means is required. We can think of this correction as a penalty for violating the assumption of equal standard deviations when we have unequal sample sizes. Alternatively, a nonparametric procedure may be used.
The Levene Test for Equal Variances
The major problem with using the F test is that it is very sensitive to data that are not normal. Statisticians say the F test is not robust to departures from normality—it may appear significant because the data are not normally distributed and not because the variances are unequal.
Figure 64. Illustration of F distribution with 60 and 120 degrees of freedom (with α = 0.05 critical area, onetailed). (Graph produced with Visual Statistics software; used with permission.) 
Several other procedures can be used to test the equality of standard deviations, and most computer programs provide options to the Fstatistic. A good alternative is the Levene test. For two groups, the Levene test is a t test of the absolute value of the distance each observation is from the mean in that group (not a t test of the original observations). So, in essence, it tests the hypothesis that the average deviations (of observations from the mean in each group) are the same in the two groups. If the value of the Levene test is significant, the conclusion is that, on average, the deviations from the mean in one group exceed those in the other. It is a good approach whether or not the data are normally distributed.
The Levene test can also be used for testing the equality of variances when more than two groups are being compared. Both the Statistical Package for the Social Sciences (SPSS) and the JMP statistical software from the SAS Institute report this statistic, and NCSS reports the modified Levene test in which the mean in each group in the preceding formula is replaced with the median in each group.
Table 64 shows computer output when JMP is used to test the equality of variances. Note that JMP provides both the average deviations from the mean and the median and that the probability is similar for all of the tests presented. Because the P value of the Levene test is greater than 0.05, 0.6701 in this example, we do not reject the hypothesis of equal variances and proceed with the t test.
Use the CDROM to test the equality of variances for pH. This test is found in the t test procedure in SPSS and NCSS and in the model fitting procedure in JMP. In NCSS the F test is called the varianceratio equalvariance test, and the Levene test is called modifiedLevene equalvariance test. Why do you think these two tests came to opposite conclusions about the equality of the variances for pH? Seeexercise 12.
Comparing Means with the Wilcoxon Rank Sum Test
Sometimes researchers want to compare two independent groups, for which one or more of the assumptions for the t test is seriously violated. The following options are available. In Chapter 5 we discussed transformations when observations are not normally distributed; this approach may also be used when two groups are being analyzed. In particular, transformations can be effective when standard deviations are not equal. More often, however, researchers in the health field use a nonparametric test. The test goes by various names: Wilcoxon rank sum test, Mann– Whitney U Test, or Mann–Whitney–Wilcoxon rank sum test.^{b} The text by Hollander and Wolfe (1999) provides information on many nonparametric tests, as does the text by Conover (1998).
Table 64. Computer listing from JMP on testing the quality of variances. 


In essence, this test tells us whether medians (as opposed to means) are different. The Wilcoxon rank sum test is available in most statistical computer packages, so our goal will be only to acquaint you with the procedure.
To illustrate the Wilcoxon rank sum test, we use the total cramping scores from Presenting Problem 2 (Harper, 1997). The first step is to rank all the scores from lowest to highest (or vice versa), ignoring the group they are in. Table 65 lists the scores and their rankings. In this example, the lowest score is 0, given by subjects 5, 21, 46, 52, 55, 77, and 81. Ordinarily, these seven subjects would be assigned ranks 1, 2, 3, 4, 5, 6, and 7. When subjects have the same or tied scores, however, the practice is to assign the average rank, so each of these seven women is given the rank of 4. This process continues until all scores have been ranked, with the highest score, 100 by subject 13, receiving the rank of 84 because there are 84 subjects.
After the observations are ranked, the ranks are analyzed just as though they were the original observations. The mean and standard deviation of the ranks are calculated for each group, and these are used to calculate the pooled standard deviation of the ranks and the t test.
The Wilcoxon rank sum method tests the hypothesis that the means of the ranks are equal. Conceptually, the test proceeds as follows: If there is no significant difference between the two groups, some low ranks and some high ranks will occur in each group, that is, the ranks will be distributed across the two groups more or less evenly. In this situation, the means ranks will be similar as well. On the other hand, if a large difference occurs between the two groups, one group will have more subjects with higher ranks than the other group, and the mean of the ranks will be higher in that group.
Use the CDROM to do the t test on the rank variable in the Harper data set. To be consistent with the t test on the original observations reported in the section titled, “Comparing Means in Two Groups with the t Test,” use α = 0.01 and do a onetailed test to see if the paracervical block results in lower cramping scores.
Using NCSS, the t test on the rank of cramping scores is 2.42, less than the critical value of 2.39, so again we reject the null hypothesis and conclude the paracervical block had a beneficial result. The output from NCSS on both the Wilcoxon rank sum test on the original observations and the t test using the ranked data is given in Table 66. Note from the shaded lines in Table 66 that the P value for the differences is 0.009364 when the Wilcoxon rank sum test is used and 0.008897 when the t test on the ranks is done instead. As we see, using the t test on ranks is a good alternative if a computer program for the Wilcoxon test is not available.
We used boxandwhisker plots to evaluate the distribution of the observations (Figure 65). What do you conclude about the distributions? The medians (denoted by tiny circles) fall midway in the boxes that enclose the 25th to 75th percentile of scores in both groups, indicating a normal distribution. The positive tail for the scores for the women receiving a block is a little longer than the negative tail, indicating a slightly positive skew, but overall, the distributions are fairly normal. So, it appears that the assumptions for the t test are adequately met in this example, and, as we would expect, the t test and the nonparametric Wilcoxon rank sum test lead to the same conclusion.
Table 65. Rank of total cramping scores from the study on paracervical block prior to cryosurgery. 



Table 66. Illustration of t test of cramping score to obtain Wilcoxon rank sum and t test of ranks of cramping score. 


The Wilcoxon rank sum test illustrated earlier and the signed rank test discussed in Chapter 5 are excellent alternatives to the t test. When assumptions are met for the t test and the null hypothesis is not true, the Wilcoxon test is almost as likely to reject the null hypothesis. Statisticians say the Wilcoxon test is as powerful as the t test. Furthermore, when the assumptions are not met, the Wilcoxon tests are more powerful than the t test.
DECISIONS ABOUT PROPORTIONS IN TWO INDEPENDENT GROUPS
We now turn to research questions in which the outcome is a counted or categorical variable. As discussed in Chapter 3, proportions or percentages are commonly used to summarize counted data. When the research question involves two independent groups, we can learn whether the proportions are different using any of three different methods:
1. Form a confidence interval for the difference in proportions using the z distribution.
2. Test the hypothesis of equal proportions using the z test.
Figure 65. Illustration of box plots to compare distributions of cramping scores. (Data, used with permission, from Harper DM: Paracervical block diminishes cramping associated with cryotherapy. J Fam Pract 1997;44:71–75. Table produced with NCSS 97, a registered trademark of the Number Cruncher Statistical System; used with permission.) 
3. Test the hypothesis of expected frequencies using a chisquare test.
The first two methods are extensions of the way we formed confidence intervals and used the z test for one proportion in Chapter 5. The chisquare method is new in this chapter, but we like this test because it is very versatile and easy to use. Although each approach gives a different numerical result, they all lead to the same conclusion about differences between two independent groups. The method that investigators decide to use depends primarily on how they think about and state their research question.
Confidence Interval for Comparing Two Independent Proportions
We discussed the survey by Lapidus and colleagues (2002) about domestic violence (DV) education and training and the use of DV screening among a sample of pediatricians and family physicians. Table 39 gives information from the study. The investigators wanted to know if the proportion of physicians who screen is different for those who did and did not have DV training. To find a confidence interval for the difference in proportions, we first need to convert the numbers to proportions. In this example, the proportion of physicians with training who screened was 175/202 = 0.866, and the proportion of physicians without training who screened was 155/266 = 0.583.
Recall that the general form for a confidence interval is
In Chapter 5 we saw that the z test provides a good approximation to the binomial when the product of the proportion and the sample size is 5 or more. We illustrated this method for research questions about one proportion in Chapter 5, and we can use a similar approach for research questions about two proportions, except that the product of the proportion and the sample size must be at least 5 in each group. To do this we let p_{1} stand for the proportion of physicians trained in DV who screened patients and p_{2} for the proportion of physicians not trained in DV who screened patients. These proportions are estimates of the proportions in the population, and the difference between the two proportions (π_{1} – π_{2} in the population) is estimated by p_{1} – p_{2} or 0.866 – 0.583 = 0.284. This difference is the statistic about which we want to form a confidence interval (the first term in the formula for a confidence interval).
Generally, we form 95% confidence intervals, so referring to the z distribution in Table A–2 and locating the value that defines the central 95% of the z distribution, we find 1.96, a value that is probably becoming familiar by now.
The third term is the standard error of the difference between two proportions. Just as with the difference in two means, it is quite a chore to calculate the standard error of the difference in two proportions, so again we will illustrate the calculations to show the logic of the statistic but expect that you will always use a computer to calculate this value.
Recall from Chapter 5 that the standard error of one proportion is
With two proportions, there are two standard errors and the standard error of the difference p_{1} – p_{2} is a combination of them.
Similar to the way two sample standard deviations are pooled when two means are compared, the two sample proportions are pooled to form a weighted average using the sample sizes as weights. The pooled proportion provides a better estimate to use in the standard error; it is designated simply as p without any subscripts and is calculated by adding the observed frequencies (n_{1} p_{1} + n_{2} p_{2}) and dividing by the sum of the sample sizes, n_{1} + n_{2}. When we substitute p = (n_{1} p_{1} + n_{2} p_{2}) ÷ (n_{1} + n_{2}) for each of p_{1} and p_{2} in the preceding formula, we have the formula for the standard error of the difference which is the square root of the product of three values: the pooled proportion p, 1 minus the pooled proportion (1 – p), and the sum of the reciprocals of the sample sizes (1/n_{1}) + (1/n_{2}). In symbols, the standard error for the difference in two proportions is
The formula for the standard error of the difference between two proportions can be thought of as an average of the standard errors in each group. Putting all these pieces together, a 95% confidence interval for the difference in two proportions is (p_{1} – p_{2}) ą 1.96 × SE (p_{1} – p_{2}). To illustrate, first find the standard error of the difference between the proportions of physicians who screened patients for DV. The two proportions are 0.866 and 0.583. The pooled, or average, proportion is therefore
As you might expect, the value of the pooled proportion, like the pooled standard deviation, always lies between the two proportions.
Next, we substitute 0.705 for the pooled proportion, p, and use the sample sizes from this study to find the standard error of the difference between the two proportions.
So, the 95% confidence interval for the difference in the two proportions is
The interpretation of this confidence interval is similar to that for other confidence intervals: Although we observed a difference of 0.284, we have 95% confidence that the interval from 0.201 to 0.367 contains the true difference in the proportion of physicians who screen patients for DV.
Because the entire confidence interval is greater than zero (ie, zero is not within the interval), we can conclude that the proportions are significantly different from each other at P < 0.05 (ie, because it is a 95% confidence interval). If, however, a confidence interval contains zero, there is not sufficient evidence to conclude a difference exists between the proportions. Please confirm these calculations using the CDROM, and obtain a 99% confidence interval as well. Are the two proportions significantly different at P < 0.01 as well?
The z Test and Two Independent Proportions
Recall that confidence intervals and hypothesis tests lead to the same conclusion. We use the same data on physicians who screen for DV from the study by Lapidus and colleagues (2002) to illustrate the z test^{c} for the difference between two independent proportions. The sixstep process for testing a statistical hypothesis follows. The symbols π_{1} and π_{2} stand for the proportion in the population of patients.
Step 1: H_{0}: The proportion of physicians with training who screen for DV is the same as the proportion of physicians without training who screen for DV, or π_{1} = π_{2}.
H_{1}: The proportion of physicians with training who screen for DV is the not same as the proportion of physicians without training who screen for DV, or π_{1} ≠ π_{2}.
Here, a twotailed or nondirectional test is used because the researcher is interested in knowing whether a difference exists in either direction; that is, whether training results in more or less patients being screened.
Step 2: The z test for one proportion, introduced in Chapter 5, can be modified to compare two independent proportions as long as the observed frequencies are ≥ 5 in each group. The test statistic, in words, is the difference between the observed proportions divided by the standard error of the difference. In terms of sample values,
where p_{1} is the proportion in one group, p_{2} is the proportion in the second group, and p (with no subscript) stands for the pooled, or average, proportion (defined in the section titled, “Confidence Interval for Comparing Two Independent Proportions”).
Step 3: Choose the level for a type I error (concluding there is a difference in the screening when there really is no difference). We use α = 0.05 so the findings will be consistent with those based on the 95% confidence interval in the previous section.
Step 4: Determining the critical value for a twotailed test at α = 0.05, the value of the z distribution that separates the upper and lower 2.5% of the area under the curve from the central 95% is ą1.96 (from Table A–2). We therefore reject the null hypothesis of equal proportions if the observed value of the z statistic is less than the critical value of 1.96 or greater than +1.96. (Before continuing, based on the confidence interval in the previous section, do you expect the value of z to be greater or less than either of these critical values?)
Step 5: Calculations are
Step 6: The observed value of z, 6.60, is greater than 1.96, so the null hypothesis—that the proportion of patients screened for DV is the same regardless of whether the physician was trained in DV—is rejected. And we conclude that different proportions of patients were screened for DV. In this situation, patients seen by physicians who had DV training were more likely to be screened for DV in the emergency department than those seen by physicians without DV training. Note the consistency of this conclusion with the 95% confidence interval in the previous section.
To report results in terms of the P value, we find the area of the z distribution closest to 6.60 in Table A–2 and see that no probabilities are given for a z value this large. In this situation, we recommend reporting P < 0.001. The actual probability can be found in NCSS under the Analysis pulldown menu: go to Other and then click on Probability Calculator. Few computer programs give procedures for the ztest for two proportions. Instead, they produce an alternative method to compare two proportions: the chisquare test, the subject of the next section. NCSS has a procedure to test two proportions as one of the choices under “Other,” and we suggest you confirm the preceding calculations using it.
Using ChiSquare to Compare Frequencies or Proportions in Two Groups
We can use the chisquare test to compare frequencies or proportions in two or more groups and in many other applications as well. Used in this manner, the test is referred to as the chisquare test for independence. This versatility is one of the reasons researchers so commonly use chisquare. In addition, the calculations are relatively easy to apply to data presented in tables. Like the zapproximation, the chisquare test is an approximate test, and using it should be guided by some general principles we will mention shortly.
Lapidus and colleagues (2002) wanted to know whether the proportion of physicians who screened patients in the emergency department for DV was the same, regardless of previous training in screening (see Table 39). Actually, we can state this research question two different ways:
1. Is there a difference in the proportions of physicians who screen and do not screen for DV? Stated this way, the chisquare test can be used to test the equality of two proportions.
2. Is there an association (or relationship or dependency) between a physician's prior DV training and whether the physician screens for DV? Stated this way, the chisquare test can be used to test whether one of the variables is associated with the other. When we state the research hypothesis in terms of independence, the chisquare test is generally (and appropriately) called the chisquare test for independence.
In fact, we use the same chisquare test regardless of how we state the research question—an illustration of the test's versatility.
An Overview of the ChiSquare Test
Before using the chisquare test with the data from Lapidus and colleagues, it is useful to have an intuitive understanding of the logic of the test. Table 67A contains data from a hypothetical study in which 100 patients are given an experimental treatment and 100 patients receive a control treatment. Fifty patients, or 25%, respond positively; the remaining 150 patients respond negatively. The numbers in the four cells are the observed frequencies in this hypothetical study.
Now, if no relationship exists between treatment and outcome, meaning that treatment and outcome are independent, we would expect approximately 25% of the patients in the treatment group and 25% of the patients in the control group to respond positively. Similarly, we would expect approximately 75% of the patients in the treatment group and approximately 75% in the control group to respond negatively. Thus, if no relationship exists between treatment and outcome, the frequencies should be as listed in Table 67B. The numbers in the cells of Table 67B are called expected frequencies.
The logic of the chisquare test follows:
1. The total number of observations in each column (treatment or control) and the total number of observations in each row (positive or negative) are considered to be given or fixed. (These column and row totals are also called marginal frequencies.)
2. If we assume that columns and rows are independent, we can calculate the number of observations expected to occur by chance—theexpected frequencies. We find the expected frequencies by multiplying the column total by the row total and dividing by the grand total. For instance, in Table 67B the number of treated patients expected to be positive by chance is (100 × 50)/200 = 25. We put this expected value in cell (1, 1) where the first 1 refers to the first row and the second 1 refers to the first column.
Table 67. Hypothetical data for chisquare. 


3. The chisquare test compares the observed frequency in each cell with the expected frequency. If no relationship exists between the column and row variables (ie, treatment and response), the observed frequencies will be very close to the expected frequencies; they will differ only by small amounts.^{d} In this instance, the value of the chisquare statistic will be small. On the other hand, if a relationship (or dependency) does occur, the observed frequencies will vary quite a bit from the expected frequencies, and the value of the chisquare statistic will be large.
Putting these ideas into symbols, O stands for the observed frequency in a cell and E for the expected frequency in a cell. In each cell, we find the difference and square it (just as we did to find the standard deviation—so, when we add them, the differences do not cancel each other—see Chapter 3). Next, we divide the squared difference by the expected value. At this point we have the following term corresponding to each cell:
Finally, we add the terms from each cell to get the chisquare statistic:
where χ^{2} stands for the chisquare statistic, and (df) stands for the degrees of freedom.
The ChiSquare Distribution
The chisquare distribution, χ^{2} (lower case Greek letter chi, pronounced like the “ki” in kite), like the t distribution, has degrees of freedom. In the chisquare test for independence, the number of degrees of freedom is equal to the number of rows minus 1 times the number of columns minus 1, or df = (r – 1)(c – 1), where r is the number of rows and c the number of columns. Figure 66 shows the chisquare distribution for 1 degree of freedom. As you can see, the chisquare distribution has no negative values. The mean of the chisquare distribution is equal to the degrees of freedom; therefore, as the degrees of freedom increase, the mean moves more to the right. In addition, the standard deviation increases as degrees of freedom increase, so the chisquare curve spreads out more as the degrees of freedom increase. In fact, as the degrees of freedom become very large, the shape of the chisquare distribution becomes more like the normal distribution.
Figure 66. Chisquare distribution corresponding to 1, 4, and 20 degrees of freedom. (Data, used with permission of the authors and publisher, Kline JA, Nelson RD, Jackson RE, Courtney DM: Criteria for the safe use of Ddimer testing in emergency department patients with suspected pulmonary embolism: A multicenter US study. Ann Emergency Med 2002;39:144–152. Plot produced with NCSS; used with permission.) 
To use the chisquare distribution for hypothesis testing, we find the critical value in Table A–5 that separates the area defined by α from that defined by 1 – α. Table A–5 contains only uppertailed values for χ^{2} because they are the values generally used in hypothesis testing. Because the chisquare distribution is different for each value of the degrees of freedom, different critical values correspond to degrees of freedom. For example, the critical value for χ^{2}_{(1)} with α = 0.05 is 3.841.
If you have access to a computer program that produces statistical distributions, find the chisquare distribution and change the degrees of freedom. Note how the curve changes as the degrees of freedom change.
The ChiSquare Test for Independence
Now we apply the chisquare test to the observations in Table 39.
Step 1: H_{0}: Training in DV and subsequent screening (ie, rows and columns) are independent.
H_{1}: Training in DV and subsequent screening (ie, rows and columns) are not independent.
Step 2: The chisquare test is appropriate for this research question because the observations are nominal data (frequencies).
Step 3: We use the traditional α of 0.05.
Step 4: The contingency table has two rows and two columns, so df = (2 – 1)(2 – 1) = 1. The critical value in Table A–5 that separates the upper 5% of the χ^{2} distribution from the remaining 95% is 3.841. The chisquare test for independence is almost always a onetailed test to see whether the observed frequencies vary from the expected frequencies by more than the amount expected by chance. We will reject the null hypothesis of independence if the observed value of χ^{2} is greater than 3.841.
Step 5: The first step in calculating the chisquare statistic is to find the expected frequencies for each cell. The illustration using hypothetical data (see Table 67) showed that expected frequencies are found by multiplying the column total by the row total and dividing by the grand total:
See Exercise 7 to learn why expected values are found this way.
As an illustration, multiplying the number of physicians with DV training, 202, by the number who screen for DV, 330, and then dividing by the total number of patients, 468, gives (202 × 330)/(468) = 143.4, the expected frequency for cell (1, 1), abbreviated E(1, 1). The expected frequencies for the remaining cells in Table 39 are listed in Table 68. None of the expected frequencies is < 5, so we can proceed with the chisquare test. (We explain why expected frequencies should not be too small in the next section.) Then, squaring the difference between the observed and expected frequencies in each cell, dividing by the expected frequency, and then adding them all to find χ^{2} gives the following:
Step 6: The observed value of χ^{2}_{(1)}, 44.41, is greater than 3.841, so we easily reject the null hypothesis of independence and conclude that a dependency or relationship exists between DV training and screening for DV. Because this study is not an experimental study, it is not possible to conclude that a DV training causes a physician to be more likely to screen patients for DV. We can only say that training and screening are associated. Use the CDROM to confirm these calculations and compare the value of χ^{2} with that in Table 68.
Table 68. A 2 × 2 table for study on DV training and screening. 



Table 69. Standard notation for chisquare 2 × 2 table. 
Using ChiSquare Tests
Because of widespread use of chisquare tests in the literature, it is worthwhile to discuss several aspects of these tests.
Shortcut ChiSquare Formula for 2 × 2 Tables
A shortcut formula simplifies the calculation of χ^{2} for 2 × 2 tables, eliminating the need to calculate expected frequencies. Table 69 gives the setup of the table for the shortcut formula.
The shortcut formula for calculating χ^{2} from a 2 × 2 contingency table is
Using this formula with data gives
This value for χ^{2} agrees (within rounding error) with the value obtained in the previous section. In fact, the two approaches are equivalent for 2 × 2 tables.
Small Expected Frequencies & Fisher's Exact Test
The chisquare procedure, like the test based on the z approximation, is an approximate method. Just as the z test should not be used unless np in both groups is > 5, the chisquare test should not be used when the expected frequencies are small. Look at the formula for chisquare.
It is easy to see that a small expected frequency in the denominator of one of the terms in the equation causes that term to be large, which, in turn, inflates the value of chisquare.
How small can expected frequencies be before we must worry about them? Although there is no absolute rule, most statisticians agree that an expected frequency of 2 or less means that the chisquare test should not be used; and many argue that chisquare should not be used if an expected frequency is less than 5. We suggest that if any expected frequency is less than 2 or if more than 20% of the expected frequencies are less than 5, then an alternative procedure called Fisher's exact test should be used for 2 × 2 tables. (We emphasize that the expected values are of concern here, not the observed values.) If the contingency table of observations is larger than 2 × 2, categories should be combined to eliminate most of the expected values < 5.
Fisher's exact test gives the exact probability of the occurrence of the observed frequencies, given the assumption of independence and the size of the marginal frequencies (row and column totals). For example, using the notation in Table 69, the probability P of obtaining the observed frequencies in the table is
Recall that ! is the symbol for factorial; that is, n! = n(n – 1)(n – 2) …(3)(2)(1).
The null hypothesis tested with both the chisquare test and Fisher's exact test is that the observed frequencies or frequencies more extreme could occur by chance, given the fixed values of the row and column totals. For Fisher's exact test, the probability for each distribution of frequencies more extreme than those observed must therefore also be calculated, and the probabilities of all the more extreme sets are added to the probability of the observed set. Fisher's exact test is especially appropriate with small sample sizes, and most statistical programs automatically provide it as an alternative to chisquare for 2 × 2 tables; see the output from SPSS in Table 68.
Readers of medical journals need a basic understanding of the purpose of this statistic and not how to calculate it, that is, you need only remember that Fisher's exact test is more appropriate than the chisquare test in 2 × 2 tables when expected frequencies are small.
Continuity Correction
Some investigators report corrected chisquare values, called chisquare with continuity correction or chisquare with Yates' correction.This correction is similar to the one for the z test for one proportion discussed in Chapter 5; it involves subtracting ˝ from the difference between observed and expected frequencies in the numerator of χ^{2} before squaring; it has the effect of making the value for χ^{2} smaller. (In the shortcut formula, n/2 is subtracted from the absolute value of ad – bc prior to squaring.)
A smaller value for χ^{2} means that the null hypothesis will not be rejected as often as it is with the larger, uncorrected chisquare; that is, it is more conservative. Thus, the risk of a type I error (rejecting the null hypothesis when it is true) is smaller; however, the risk of a type II error (not rejecting the null hypothesis when it is false and should be rejected) then increases. Some statisticians recommend the use of the continuity correction for all 2 × 2 tables but others caution against its use. Both corrected and uncorrected chisquare statistics are commonly encountered in the medical literature.
Risk Ratios versus ChiSquare
Both the chisquare test and the z approximation test allow investigators to test a hypothesis about equal proportions or about a relationship between two nominal measures, depending on how the research hypothesis is articulated. It may have occurred to you that the risk ratios (relative risk or odds ratio) introduced in Chapter 3 could also be used with 2 × 2 tables when the question is about an association. The statistic selected depends on the purpose of the analysis. If the objective is to estimate the relationship between two nominal measures, then the relative risk or the odds ratio is appropriate. Furthermore, confidence intervals can be found for relative risks and odds ratios (illustrated in Chapter 8), which, for all practical purposes, accomplish the same end as a significance test. Confidence intervals for risk ratios are being used with increasing frequency in medical journals.
Overuse of ChiSquare
Because the chisquare test is so easy to understand and calculate, it is sometimes used when another method is more appropriate. A common misuse of chisquare tests occurs when two groups are being analyzed and the characteristic of interest is measured on a numerical scale. Instead of correctly using the t test, researchers convert the numerical scale to an ordinal or even binary scale and then use chisquare. As an example, investigators brought the following problem to one of us.
Some patients who undergo a surgical procedure are more likely to have complications than other patients. The investigators collected data on one group of patients who had complications following surgery and on another group of patients who did not have complications, and they wanted to know whether a relationship existed between the patient's age and the patient's likelihood of having a complication. The investigators had formed a 2 × 2 contingency table, with the columns being complication versus no complication, and the rows being patient age ≥ 45 years versus age < 45 years. The investigators had performed a chisquare test for independence. The results, much to their surprise, indicated no relationship between age and complication.
The problem was the arbitrary selection of 45 years as a cutoff point for age. When a t test was performed, the mean age of patients who had complications was significantly greater than the mean age of patients who did not. Fortyfive years of age, although meaningful perhaps from a clinical perspective related to other factors, was not the age sensitive to the occurrence of complications.
When numerical variables are analyzed with methods designed for ordinal or categorical variables, the greater specificity or detail of the numerical measurement is wasted. Investigators may opt to categorize a numerical variable, such as age, for graphic or tabular presentation or for use in logistic regression (Chapter 10), but only after investigating whether the categories are appropriate.
FINDING SAMPLE SIZES FOR MEANS AND PROPORTIONS IN TWO GROUPS
In Chapter 5 we discussed the importance of having enough subjects in a study to find significance if a difference really occurs. We saw that a relationship exists between sample size and being able to conclude that a difference exists: As the sample size increases, the power to detect an actual difference also increases. The process of estimating the number of subjects for a study is called finding the power of the study. Knowing the sample sizes needed is helpful in determining whether a negative study is negative because the sample size was too small. More and more journal editors now require authors to provide this key information as do all funding agencies.
Just as with studies involving only one group, a variety of formulas can be used to estimate how large a sample is needed, and several computer programs are available for this purpose as well. The formulas given in the following section protect against both type I and type II errors for two common situations: when a study involves two means or two proportions.
Finding the Sample Size for Studies About Means in Two Groups
This section presents the process to estimate the approximate sample size for a study comparing the means in two independent groups of subjects. The researcher needs to answer four questions, the first two of which are the same as those presented in Chapter 5 for one group:
1. What level of significance (α level or P value) related to the null hypothesis is wanted?
2. How great should the chances be of detecting an actual difference; that is, what is the desired level of power (equal to 1 – β)?
3. How large should the difference between the mean in one group and the mean in the other group be for the difference to have clinical importance?
4. What is a good estimate of the standard deviations? To simplify this process, we assume that the standard deviations in the two populations are equal.
To summarize, if ľ_{1} – ľ_{2} is the magnitude of the difference to be detected between the two groups, σ is the estimate of the standard deviation in each group, zα is the twotailed value of z related to α, and z_{β} is the lower onetailed value of z related to β, then the sample size needed in each group is
To illustrate this formula, recall that Kline and his colleagues (2002) in Presenting Problem 1 compared various outcomes for 181 patients who had a pulmonary embolism with 742 patients who did not. We found the 95% confidence interval for the difference in pulse oximetry (2.39) in the section titled, “Comparing Two Means Using Confidence Intervals.” The 95% CI, 1.66 to 3.11, does not contain 0, so we conclude that a difference exists between pulse oximetry in the two groups. Suppose the investigators, prior to beginning their study, wished to have a large enough sample of patients to be able to detect a mean difference of 2 or more. Assume they were willing to accept a type I error of 0.05 and wanted to be able to detect a true difference with 0.80 probability (β = 0.20). Based on their clinical experience, they estimated the standard deviation as 6. Using these values, what sample size is needed?
The twotailed z value for α of 0.05 is ą1.96 and the lower onetailed z value for β of 0.20 is ~0.84 (the critical value separating the lower 20% of the z distribution from the upper 80%). From the given estimates, the sample size for each group is
Thus, 142 patients are needed in each group if the investigators want to have an 80% chance (or 80% power) of detecting a mean difference of 2 or more in pulse oximetry. One advantage of using computer programs for power analysis is that they permit different sample sizes. The output from the PASS program for two means, assuming there are approximately 4 times as many patients without a PE as with a PE, is given in Box 61. The plot makes it easy to see the relationship between the sample size (N_{1}) and power.
Shortcut Estimate of Sample Size
We developed a rule of thumb for quickly estimating the sample sizes needed to compare the means of two groups. First, determine the ratio of the standard deviation to the difference to be detected between the means [σ/(ľ_{1} – ľ_{2})]; then, square this ratio. For a study with a Pvalue of 0.05, an experiment will have a 90% chance of detecting an actual difference between the two groups if the sample size in each group is approximately 20 times the squared ratio. For a study with the same P value but only an 80% chance of detecting an actual difference, a sample size of approximately 15 times the squared ratio is required. In the previous example, we would have 15 × (6/2)^{2}, or 135 subjects, a slight underestimate. Exercise 8 allows you to learn how this rule of thumb was obtained.
Note that the estimates we calculated assume that the sample sizes are equal in the two groups. As illustrated in Box 61, many computer programs, including PASS and nQuery, provide estimates for unequal sample sizes.
Finding the Sample Size for Studies About Proportions in Two Groups
This section presents the formula for estimating the approximate sample size needed in a study with two groups when the outcome is expressed in terms of proportions. Just as with studies involving two means, the researcher must answer four questions.
1. What is the desired level of significance (the α level) related to the null hypothesis?
2. What should be the chance of detecting an actual difference, that is, what the desired power (1 – β) to be associated with the alternative hypothesis?
3. How large should the difference be between the two proportions for it to be clinically significant?
4. What is a good estimate of the standard deviation in the population? For a proportion, it is easy: The null hypothesis assumes the proportions are equal, and the proportion itself determines the estimated standard deviation: π (1 – π).
To simplify matters, we again assume that the sample sizes are the same in the two groups. The symbol π_{1} denotes the proportion in one group, and π_{2} the proportion in the other group. Then, the formula for n is
where z_{α} is the twotailed z value related to the null hypothesis and z_{β} is the lower onetailed z value related to the alternative hypothesis.
To illustrate, we use the study by Lapidus and colleagues (2002) of screening for domestic violence. Among physicians with training in DV, 175 of 202 reported they screen routinely or selectively (0.866), compared with 155 of 266 physicians without DV training (0.583). We found that the 95% confidence interval for the difference in proportions was 0.201 to 0.357, and because the interval does not contain 0, we concluded a difference existed in the proportion who screen for DV. Suppose that the investigators, prior to doing the study, wanted to estimate the sample size needed to detect a significant difference if the proportions who screened were 0.85 and 0.55. They are willing to accept a type I error (or falsely concluding that a difference exists when none really occurred) of 0.05, and they wanted a 0.90 probability of detecting a true difference (ie, 90% power).
BOX 61. TWOSAMPLE T TEST POWER ANALYSIS FOR PULSE OXIMETRY.


Figure. No Caption available. Source: Data, used with permission of the authors and publisher, Kline JA, Nelson RD, Jackson RE, Courtney DM: Criteria for the safe use of Ddimer testing in emergency department patients with suspected pulmonary embolism: A multicenter US study. Ann Emerg Med 2002;39:144–1524. Analysis produced with NCSS; used with permission. 
The twotailed z value related to α is +1.96, and the lower onetailed z value related to β is 1.645, the value that separates the lower 10% of the z distribution from the upper 90%. Then, the estimated sample size is
We use the nQuery program with data from Lapidus and colleagues to illustrate finding the sample size for the difference in two proportions. The table and plot produced by nQuery are given in Figure 67 and indicate that n needs to be slightly larger than our estimate.
Figure 67. Twosample test for proportions power analysis using nQuery Advisor; used with permission. 
SUMMARY
This chapter has focused on statistical methods that are useful in determining whether two independent groups differ on an outcome measure. In the next chapter, we extend the discussion to studies that involve more than two groups.
The t test is used when the outcome is measured on a numerical scale. If the distribution of the observations is skewed or if the standard deviations in the two groups are different, the Wilcoxon rank sum test is the procedure of choice. In fact, it is such a good substitute for the t test that some statisticians recommend it for almost all situations.
The chisquare test is used with counts or frequencies when two groups are being analyzed. We discussed what to do when sample sizes are small, commonly referred to as small expected frequencies. We recommend Fisher's exact test with a 2 × 2 table. We briefly touched on some other issues related to the use of chisquare in medical studies.
In Presenting Problem 1, Kline and his colleagues (2002) wanted to know if patients who experienced a pulmonary embolism (PE) differed from those who did not, and they looked at several outcomes. The researchers found a difference in heart rate, systolic blood pressure, pH, and pulse oximetry. Patients who had a PE had higher heart rates and pH, but lower systolic blood pressure and pulse oximetry. A goal of their study was to find a decision rule that would divide patients with suspected PE into a high risk group in which the Ddimer test should not be used and a lowrisk group in which the test is appropriate. We revisit their study in Chapter 12.
We used the study by Harper (1997) to illustrate the t test for two independent groups. Harper wanted to know whether women undergoing cryosurgery who had a paracervical block before the surgery experienced less pain and cramping than women who did not have the block. We compared the scores that women assigned to the degree of cramping they experienced with the procedure. Women who had the paracervical block had significantly lower scores, indicating they experienced less severe cramping. We used the same data to illustrate the Wilcoxon rank sum test and came to the same conclusion. The Wilcoxon test is recommended when assumptions for the t test (normal distribution, equal variances) are not met. The investigator reported that women receiving paracervical block perceived less cramping than those that did not receive it, a result that is consistent with our analysis. The paracervical block did not decrease the perception of pain, however.
Turning to research questions involving nominal or categorical outcomes, we introduced the z statistic for comparing two proportions and the chisquare test. In Lapidus and colleagues' (2002) study, investigators were interested in learning whether training in domestic violence (DV) and subsequent screening patients for DV were related. We used the same data to illustrate the construction of confidence intervals and the z test for two proportions and came to the same conclusion, illustrating once more the equivalence between the conclusions reached using confidence intervals and statistical tests.
The chisquare test uses observed frequencies and compares them to the frequencies that would be expected if no differences existed in proportions. We again used the data from Lapidus and colleagues (2002) to illustrate the chisquare test for two groups, that is, for observations that can be displayed in a 2 × 2 table. Once more, the results of the statistical test indicated that a difference existed in proportions of physicians who screened for DV, depending on whether they had been trained to do so.
The importance of sample size calculations was again stressed. We illustrated formulas and computer programs that estimate the sample sizes needed when two independent groups of subjects are being compared.
A summary of the statistical methods discussed in this chapter is given in Appendix C.
EXERCISES
1. How does a decrease in sample size affect the confidence interval? Recalculate the confidence interval for pulse oximetry in the section titled, “Decisions About Means in Two Independent Groups,” assuming that the means and standard deviations were the same but only 25 patients were in each group. Recalculate the pooled standard deviation and standard error and repeat the confidence interval. Is the conclusion the same?
2. Calculate the pooled standard deviation for the total cramping score from Table 63.
3. Good and colleagues (1996) used the Barthel index (BI) to measure mobility and activities of daily living in a group of patients who had sleep apnea. This breathing disorder is characterized by periodic reductions in the depth of breathing (hypopnea), periodic cessation of breathing (apnea), or a continuous reduction in ventilation. The Barthel index, a standardized scale that measures mobility and activities of daily living, was recorded at admission, at discharge, and at 3 and 12 months after stroke onset. Data files are on the CDROM in a folder called “Good.”
a. Did patients with a desaturation index (DI) < 10 have the same mean BI at discharge as patients with a DI ≥ 10? Answer this question using a 95% confidence interval.
b. Did a significant increase occur in BI from the time of admission until discharge for all the patients in the study (ie, ignoring the desaturation index)? Answer this question using a 95% confidence interval.
4. Show that the pooled standard deviation for two means is the average of the two standard deviations when the sample sizes are equal.
5. Use the data from Kline and colleagues (2002) to compare pulse oximetry in patients who did and those who did not have a PE. Compare the conclusion with the confidence interval in the section titled, “Decisions About Means in Two Independent Groups.”
6. Use the rules for finding the probability of independent events to show why the expected frequency in the chisquare statistic is found by the following formula:
7. How was the rule of thumb for calculating the sample size for two independent groups found?
8. Refer to the study by Good and colleagues (1996) on patients with stroke. How large a sample is needed to detect a difference of 0.85 versus 0.55 in the proportions discharged home with 80% power?
9. Compute the 90% and 99% confidence intervals for the difference in pulse oximetry for the patients with and without PE (Kline, 2002). Compare these intervals with the 95% interval obtained in the section titled, “Comparing Two Means Using Confidence Intervals.” What is the effect of lower confidence on the width of the interval? Of higher confidence?
10. Suppose investigators compared the number of cardiac procedures performed by 60 cardiologists in large health centers during one year to the number of procedures done by 25 cardiologists in midsized health centers. They found no significant difference between the number of procedures performed by the average cardiologist in large centers and those performed in midsized centers using the ttest. When they reanalyzed the data using the Wilcoxon rank sum test, however, the investigators noted a difference. What is the most likely explanation for the different findings?
11. Benson and colleagues (1996) designed a randomized clinical trial to learn whether a vaginal or an abdominal approach is more effective in surgically treating severe uterovaginal prolapse. Over a 2year period, women were assigned on the basis of a random number table to have pelvic reconstruction surgery by either a vaginal or an abdominal approach. Surgical outcomes were noted as optimally effective, satisfactorily effective, or unsatisfactorily effective based on an assessment of prolapse symptoms and integrity of the vaginal support during a Valsalva strain maneuver. The patients were examined postoperatively at 6 months and then annually for up to 5 years. Other outcome measures included charges for hospital stay, length of stay, and time required in the operating room. Data from this study are given in Table 610 and on the CDROM. Perform an appropriate statistical procedure to answer the following questions:
a. Do the groups show a difference in the operating room time?
b. Are the variances of operating room times similar in both groups?
Table 610. Means and standard deviations on variables from the study on reconstructive surgery for pelvic defects. 


12. When testing the variances of pH for those who had a PE and those who did not (Kline, 2002), the F test indicated the variances were unequal, but the Levene test indicated they were not? What is the most likely explanation for this seeming contradiction? Use the data on the CDROM to form histograms or box plots for the two groups. What do you notice?
13. Recall that in Chapter 5 exercises we examined box plots for daily juice consumption by 2 and 5yearolds (Dennison et al, 1997). We asked you to say whether you thought the two groups drank different amounts of juice. Now, use the t test to learn if the means are different.
14. Group Exercise. Many older patients use numerous medications, and, as patients age, the chances for medication errors increases. Gurwitz and colleagues (2003) undertook a study of all Medicare patients seen by a group of physicians (multispecialty) during 1 year. The primary outcomes were number of adverse drug events, seriousness of the events, and whether they could have been prevented. Obtain a copy of the article to answer the following questions.
a. What was the study design? Why was this design particularly appropriate?
b. What methods did the investigators use to learn the outcomes? Were they sufficient?
c. What statistical approach was used to evaluate the outcome rates?
d. What statistical methods were used to analyze the characteristics in Table 1 in the Gurwitz study?
e. What was the major conclusion from the study? Was this conclusion justified? Would additional information help readers decide whether the conclusion was appropriate?
15. Group Exercise. Physicians and dentists may be at risk for exposure to bloodborne diseases during invasive surgical procedures. In a study that is still relevant, Serrano coworkers (1991) wanted to determine the incidence of glove perforation during obstetric procedures and identify risk factors. The latex gloves of all members of the surgical teams performing cesarean deliveries, postpartum tubal ligations, and vaginal deliveries were collected for study; 100 unused gloves served as controls. Each glove was tested by inflating it with a pressurized air hose to 1.5–2 times the normal volume and submerging it in water. Perforations were detected by the presence of air bubbles when gentle pressure was applied to the palmar surface. Among the 754 study gloves, 100 had holes; none of the 100 unused control gloves had holes. In analyzing the data, the investigators found that 19 of the gloves with holes were among the 64 gloves worn by scrub technicians. Obtain a copy of this paper from your medical library and use it to help answer the following questions:
a. What is your explanation for the high perforation rate in gloves worn by scrub technicians? What should be done about these gloves in the analysis?
b. Are there other possible sources of bias in the way this study was designed?
c. An analysis reported by the investigators was based on 462 gloves used by house staff. The levels of training, number of gloves used, and number of gloves with holes were as follows: Interns used 262 gloves, 30 with holes; year 2 residents used 71 gloves, 9 with holes; year 3 residents used 58 gloves, 4 with holes; and year 4 residents used 71 gloves, 17 with holes. Confirm that a relationship exists between training level and proportion of perforation, and explain the differences in proportions of perforations.
d. What conclusions do you draw from this study? Do your conclusions agree with those of the investigators?
Footnotes
^{a}To be precise, the confidence interval is interpreted as follows: 95% of such confidence intervals contain the true difference between the two means if repeated random samples of operating room times are selected and 95% confidence intervals are calculated for each sample.
^{b}As an aside, the different names for this statistic occurred when a statistician, Wilcoxon, developed the test at about the same time as a pair of statisticians, Mann and Whitney. Unfortunately for readers of the medical literature, there is still no agreement on which name to use for this test.
^{c}The z test for the difference between two independent proportions is actually an approximate test. That is why we must assume the proportion times the sample size is > 5 in each group. If not, we must use the binomial distribution or, if np is really small, we might use the Poisson distribution (both introduced in Chapter 4).
^{d}We say small amounts because of what is called sampling variability—variation among different samples of patients who could be randomly selected for the study.