﻿ Probability & Related Topics for Making Inferences About Data - Basic & Clinical Biostatistics, 4th Edition

## Basic & Clinical Biostatistics, 4th Edition

### 4. Probability & Related Topics for Making Inferences About Data

KEY CONCEPTS

 Probability is an important concept in statistics. Both objective and subjective probabilities are used in the medical field. The normal distribution is used to find the probability that an outcome occurs when the observations have a bell-shaped distribution. It is used in many statistical procedures. Basic definitions include the concept of an event or outcome. A number of essential rules tell us how to combine the probabilities of events. If many random samples are drawn from a population, a statistic, such as the mean, follows a distribution called a sampling distribution. Bayes' theorem relates to the concept of conditional probability—the probability of an outcome depending on an earlier outcome. Bayes' theorem is part of the reasoning process when interpreting diagnostic procedures. The central limit theorem tells us that means of observations, regardless of how they are distributed, begin to follow a normal distribution as the sample size increases. This is one of the reasons the normal distribution is so important in statistics. Populations are rarely studied; instead, researchers study samples. Several methods of sampling are used in medical research; a key issue is that any method should be random. It is important to know the difference between the standard deviation, which describes the spread of individual observations, from the standard error of the mean, which describes the spread of the mean observations. When researchers select random samples and then make measurements, the result is a random variable. This process makes statistical tests and inferences possible. One of the purposes of statistics is to use a sample to estimate something about the population. Estimates form the basis of statistical tests. The binomial distribution is used to determine the probability of yes/no events—the number of times a given outcome occurs in a given number of attempts. Confidence intervals can be formed around an estimate to tell us how much the estimate would vary in repeated samples. The Poisson distribution is used to determine the probability of rare events.

PRESENTING PROBLEM

Presenting Problem 1

Neisseria meningitidis, a gram-negative diplococcus, has as its natural reservoir the human posterior nasopharynx where it can be cultured from 2–15% of healthy individuals during nonepidemic periods. The bacterial organism can be typed into at least 13 serogroups based on capsular antigens. These serogroups can be further subdivided by antibodies to specific subcapsular membrane proteins. In the United States, serogroups B and C have accounted for 90% of meningococcal meningitis cases in recent decades. The major manifestations of meningococcal disease are acute septicemia and purulent meningitis. The age-specific attack rate is greatest for children under 5 years of age.

Epidemiologic surveillance data from the state of Oregon detected an increase in the overall incidence rate of meningococcal disease from 2 cases per 100,000 population during 1987–1992 to 4.5 cases per 100,000 population in 1994 (Diermayer et al, 1999). Epidemiologists from Oregon and the Centers for Disease Control wanted to know if the increased numbers of cases of meningococcal disease were indications of a transition from endemic to epidemic disease. The investigators found a significant rise in serogroup B disease; they also discovered that most of the isolates belonged to the ET-5 clonal strain of this serogroup. In addition, a shift toward disease in older age groups, especially 15- through 19-year-olds, was observed.

Information from the study is given in the section titled, “Basic Definitions and Rules of Probability.” We use these data to illustrate basic concepts of probability and to demonstrate the relationship between time period during the epidemic and site of infection.

Presenting Problem 2

A local blood bank was asked to provide information on the distribution of blood types among males and females. This information is useful in illustrating some basic principles in probability theory. The results are given in “Basic Definitions and Rules of Probability.”

Presenting Problem 3

In the United States, prostate cancer is the second leading cause of death among men who die of neoplasia, accounting for 12.3% of cancer deaths. Controversial management issues include when to treat a patient with radical prostatectomy and when to use definitive radiation therapy. Radical prostatectomy is associated with a high incidence of impotence and occasional urinary incontinence. Radiation therapy produces less impotence but can cause radiation cystitis, proctitis, and dermatitis. Prostate specific antigen (PSA) evaluation, available since 1988, leads to early detection of prostate cancer and of recurrence following treatment and may be a valuable prognostic indicator and measure of tumor control after treatment.

Although radical radiation therapy is used to treat prostate cancer in about 60,000 men each year, only a small number of these men from any single institution have had a follow-up of more than 5 years during the era when the PSA test has been available. Shipley and colleagues (1999) wanted to assess the cancer control rates for men treated with external beam radiation therapy alone by pooling data on 1765 men with clinically localized prostate cancer treated at six institutions. The PSA value, along with the Gleason score (a histologic scoring system in which a low score indicates well-differentiated tumor and a high score poorly differentiated tumor) and tumor palpation state, was used to assess pretreatment prognostic factors in the retrospective, nonrandomized, multiinstitutional pooled analysis. A primary treatment outcome was the measurement of survival free from biochemical recurrence. Biochemical recurrence was defined as three consecutive rises in PSA values or any rise great enough to trigger additional treatment with androgen suppression.

Prognostic indicators including pretreatment PSA values indicate the probability of success of treatment with external beam radiation therapy for subsets of patients with prostate cancer. The probabilities of 5-year survival in men with given levels of pretreatment PSA are given in the section titled, “Sampling Distributions.” We use these rates to illustrate the binomial probability distribution.

Presenting Problem 4

The Coronary Artery Surgery Study was a classic study in 1983; it was a prospective, randomized, multicenter collaborative trial of medical and surgical therapy in subsets of patients with stable ischemic heart disease. This classic study established that the 10-year survival rate in this group of patients was equally good in the medically treated and surgically (coronary revascularization) treated groups (Alderman et al, 1990). A second part of the study compared the effects of medical and surgical treatment on the quality of life.

Over a 5-year period, 780 patients with stable ischemic heart disease were subdivided into three clinical subsets (groups A, B, and C). Patients within each subset were randomly assigned to either medical or surgical treatment. All patients enrolled had 50% or greater stenosis of the left main coronary artery or 70% or greater stenosis of the other operable vessels. In addition, group A had mild angina and an ejection fraction of at least 50%; group B had mild angina and an ejection fraction less than 50%; group C had no angina after myocardial infarction. History, examination, and treadmill testing were done at 6, 18, and 60 months; a follow-up questionnaire was completed at 6-month intervals. Quality of life was evaluated by assessing chest pain status; heart failure; activity limitation; employment status; recreational status; drug therapy; number of hospitalizations; and risk factor alteration, such as smoking status, BP control, and cholesterol level. Data on number of hospitalizations after mean follow-up of 11 years will be used to illustrate the Poisson probability distribution (Rogers et al, 1990).

Presenting Problem 5

An individual's BP has important health implications; hypertension is among the most commonly treated chronic medical problems. To examine variation in BP, Marczak and Paprocki (2001) found the mean and standard deviation in a group of healthy persons. For men and women between the ages of 14 and 70, mean 24-h systolic pressure was 119.7 mm Hg, and the standard deviation was 10.9. We use this information to calculate probabilities of any patient having a given BP.

PURPOSE OF THE CHAPTER

The previous chapter presented methods for summarizing information from studies: graphs, plots, and summary statistics. A major reason for performing clinical research, however, is to generalize the findings from the set of observations on one group of subjects to others who are similar to those subjects. Shipley and colleagues (1999) concluded that the initial level of PSA can be used to estimate freedom from biochemical recurrence of tumor. This conclusion was based on their study and follow-up for at least 5 years of 448 men from six institutions. Studying all patients in the world with T1b, T1c, or T2 tumors (but unknown nodal status) is neither possible nor desirable; therefore, the investigators made inferences to a larger population of patients on the basis of their study of a sample of patients. They cannot be sure that men with a specific level of pretreatment PSA will respond to treatment as the average man did in this study, but they can use the data to find the probability of a positive response.

The concepts in this chapter will enable you to understand what investigators mean when they make statements like the following:

The difference between treatment and control groups was tested by using a t test and found to be significantly greater than zero.

An α value of 0.01 was used for all statistical tests.

The sample sizes were determined to give 90% power of detecting a difference of 30% between treatment and control groups.

Our experience indicates that the concepts underlying statistical inference are not easily absorbed in a first reading. We suggest that you read this chapter and become acquainted with the basic concepts and then, after completing Chapters 5 through 9, read it again. It should be easier to understand the basic ideas of inference using this approach.

THE MEANING OF THE TERM “PROBABILITY”

Assume that an experiment can be repeated many times, with each replication (repetition) called a trial and assume that one or more outcomes can result from each trial. Then, the probability of a given outcome is the number of times that outcome occurs divided by the total number of trials. If the outcome is sure to occur, it has a probability of 1; if an outcome cannot occur, its probability is 0.

An estimate of probability may be determined empirically, or it may be based on a theoretical model. We know that the probability of flipping a fair coin and getting tails is 0.50, or 50%. If a coin is flipped ten times, there is no guarantee, of course, that exactly five tails will be observed; the proportion of tails can range from 0 to 1, although in most cases we expect it to be closer to 0.50 than to 0 or 1. If the coin is flipped 100 times, the chances are even better that the proportion of tails will be close to 0.50, and with 1000 flips, the chances are better still. As the number of flips becomes larger, the proportion of coin flips that result in tails approaches 0.50; therefore, the probability of tails on any one flip is 0.50.

This definition of probability is sometimes called objective probability, as opposed to subjective probability, which reflects a person's opinion, hunch, or best guess about whether an outcome will occur. Subjective probabilities are important in medicine because they form the basis of a physician's opinion about whether a patient has a specific disease. In Chapter 12 we discuss how this estimate, based on information gained in the history and physical examination, changes as the result way physicians use probability in speaking to patients, and Goodman (1999) discusses interesting aspects of the history of probability in an accompanying editorial.

Basic Definitions & Rules of Probability

Probability concepts are helpful for understanding and interpreting data presented in tables and graphs in published articles. In addition, the concept of probability lets us make statements about how much confidence we have in such estimates as means, proportions, or relative risks (introduced in the previous chapter). Understanding probability is essential for understanding the meaning of P values given in journal articles.

We use two examples to illustrate some definitions and rules for determining probabilities: Presenting Problem 1 on meningococcal disease (Table 4-1) and the information given in Table 4-2 on gender and blood type. All illustrations of probability assume the observation has been randomly selected from a population of observations. We discuss these concepts in more detail in the next section.

In probability, an experiment is defined as any planned process of data collection. For Presenting Problem 1, the experiment is the process of determining the site of infection in patients with meningococcal disease. An experiment consists of a number of independent trials(replications) under the same conditions; in this example, a trial consists of determining the site of infection for an individual person. Each trial can result in one of four outcomes: sepsis, meningitis, both sepsis and meningitis, or unknown.

Table 4-1. Characteristics of serogroup B cases, Oregon, 1987–1996.a

 Time Period Preepidemic 1987–1992 Early Epidemic 1993–1994 Recent Epidemic 1995–1996 Count Column % Count Column % Count Column % Sex Men 75 50 59 50 75 53 Women 75 50 58 50 66 47 Race White 120 80 94 80 110 78 African American 5 3 0 0 1 1 Hispanic 2 1 10 9 11 8 Native American 2 1 2 2 0 0 Asian 0 0 0 0 2 1 Unknown 21 14 11 9 17 12 Site of infection Sepsis 66 44 45 38 40 28 Meningitis 39 26 32 27 39 28 Both 39 26 32 27 34 24 Unknown 6 4 8 7 28 20 Died during epidemic No 141 94 108 92 132 94 Yes 9 6 9 8 9 6 aData are presented as numbers and percentages.Source: Adapted, with permission, from Diermayer M, Hedberg K, Hoesly F, Fischer M, Perkins B, Reeves M, et al: Epidemic serogroup B meningococcal disease in Oregon. JAMA 1999;281:1493–1497. Produced with SPSS; used with permission.

The probability of a particular outcome, say outcome A, is written P(A). The data from Table 4-1 have been condensed into a table on the site of infection with total numbers and are given in Table 4-3. For example, in Table 4-3, if outcome A is sepsis, the probability that a randomly selected person from the study has meningitis without sepsis as the site of infection is

In Presenting Problem 2, the probabilities of different outcomes are already computed. The outcomes of each trial to determine blood type are O, A, B, and AB. From Table 4-2, the probability that a randomly selected person has type A blood is

The blood type data illustrate two important features of probability:

1. The probability of each outcome (blood type) is greater than or equal to 0.

2. The sum of the probabilities of the various outcomes is 1.

Table 4-2. Distribution of blood type by gender.

 Probabilities Blood Type Males Females Total O 0.21 0.21 0.42 A 0.215 0.215 0.43 B 0.055 0.055 0.11 AB 0.02 0.02 0.04 Total 0.50 0.50 1.00

Events may be defined either as a single outcome or a set of outcomes. For example, the outcomes for the site of infection in the meningitis study are sepsis, meningitis, both, or unknown, but we may wish to define an event as having known meningitis versus not having known meningitis. The event of known meningitis contains the two outcomes of meningitis alone plus both (meningitis and sepsis), and the event of not having known meningitis also contains two outcomes (sepsis and unknown).

Sometimes, we want to know the probability that an event will not happen; an event opposite to the event of interest is called acomplementary event. For example, the complementary event to “having known meningitis” is “not having known meningitis.” The probability of the complement is

Note that the probability of a complementary event may also be found as 1 minus the probability of the event itself, and this calculation may be easier in some situations. To illustrate,

Table 4-3. Site of infection for serogroup B cases, Oregon, 1987–1996.

 Time Period Preepidemic 1987–1992 Early Epidemic 1993–1994 Recent Epidemic 1995–1996 Total Site of infection Sepsis 66 45 40 151 Meningitis 39 32 39 110 Both 39 32 34 105 Unknown 6 8 28 42 Total 150 117 141 408 Source: Adapted, with permission, from Diermayer M, Hedberg K, Hoesly F, Fischer M, Perkins B, Reeves M, et al: Epidemic serogroup B meningococcal disease in Oregon. JAMA 1999;281:1493–1497. Produced with SPSS; used with permission.

Mutually Exclusive Events & the Addition Rule

Two or more events are mutually exclusive if the occurrence of one precludes the occurrence of the others. For example, a person cannot have both blood type O and blood type A. By definition, all complementary events are also mutually exclusive; however, events can be mutually exclusive without being complementary if three or more events are possible.

As we indicated earlier, what constitutes an event is a matter of definition. Let us define the experiment in Presenting Problem 2 so that each outcome (blood type O, A, B, or AB) is a separate event. The probability of two mutually exclusive events occurring is the probability that either one event occurs or the other event occurs. This probability is found by adding the probabilities of the two events, which is called the addition rule for probabilities. For example, the probability that a randomly selected person has either blood type O or blood type A is

Does the addition rule work for more than two events? The answer is yes, as long as they are all mutually exclusive. We discuss the approach to use with nonmutually exclusive events in the section titled, “Nonmutually Exclusive Events and the Modified Addition Rule.”

Independent Events & the Multiplication Rule

Two different events are independent events if the outcome of one event has no effect on the outcome of the second. Using the blood type example, let us also define a second event as the gender of the person; this event consists of the outcomes male and female. In this example, gender and blood type are independent events; the sex of a person does not affect the person's blood type, and vice versa. The probability of two independent events is the probability that both events occur and is found by multiplying the probabilities of the two events, which is called the multiplication rule for probabilities. The probability of being male and of having blood type O is

The probability of being male, 0.50, and the probability of having blood type O, 0.42, are both called marginal probabilities because they appear on the margins of a probability table. The probability of being male and of having blood type O, 0.21, is called a joint probability;it is the probability of both male and type O occurring jointly.

Is having an unknown site of infection independent from the time period of the epidemic in the Diermayer study? Table 4-3 gives the data we need to answer to this question. If two events are independent, the product of the marginal probabilities will equal the joint probability in all instances. To show that two events are not independent, we need demonstrate only one instance in which the product of the marginal probabilities is not equal to the joint probability. For example, to show that having an unknown site of infection and pre-epidemic period are not independent, find the joint probability of a randomly selected person having an unknown site and being diagnosed in the pre-epidemic period. Table 4-3 shows that

However, the product of the marginal probabilities does not yield the same result; that is,

We could show that the product of the marginal probabilities is not equal to the joint probability for any of the combinations in this example, but we need show only one instance to prove that two events are not independent.

Nonindependent Events & the Modified Multiplication Rule

Finding the joint probability of two events when they are not independent is a bit more complex than simply multiplying the two marginal probabilities. When two events are not independent, the occurrence of one event depends on whether the other event has occurred. Let Astand for the event “known meningitis” and B for the event “recent epidemic” (in which known meningitis is having either meningitis alone or meningitis with sepsis). We want to know the probability of event A given event B, written P(A | B) where the vertical line, |, is read as “given.” In other words, we want to know the probability of event A, assuming that event B has happened. From the data in Table 4-3, the probability of known meningitis, given that the period of interest is the recent epidemic, is

This probability, called a conditional probability, is the probability of one event given that another event has occurred. Put another way, the probability of a patient having known meningitis is conditional on the period of the epidemic; it is substituted for P(known meningitis) in the multiplication rule. If we put these expressions together, we can find the joint probability of having known meningitis and contracting the disease in the recent epidemic:

The probability of having known meningitis during the recent epidemic can also be determined by finding the conditional probability of contracting the disease during the recent epidemic period, given known meningitis, and substituting that expression in the multiplication rule for P(recent epidemic). To illustrate,

Nonmutually Exclusive Events & the Modified Addition Rule

Remember that two or more mutually exclusive events cannot occur together, and the addition rule applies for the calculation of the probability that one or another of the events occurs. Now we find the probability that either of two events occurs when they are not mutually exclusive. For example, gender and blood type O are nonmutually exclusive events because the occurrence of one does not preclude the occurrence of the other. The addition rule must be modified in this situation; otherwise, the probability that both events occur will be added into the calculation twice.

In Table 4-2, the probability of being male is 0.50 and the probability of blood type O is 0.42. The probability of being male or of having blood type O is not 0.50 + 0.42, however, because in this sum, males with type O blood have been counted twice. The joint probability of being male and having blood type O, 0.21, must therefore be subtracted. The calculation is

Of course, if we do not know that P(male and type O) = 0.21, we must use the multiplication rule (for independent events, in this case) to determine this probability.

Summary of Rules & an Extension

Let us summarize the rules presented thus far so we can extend them to obtain a particularly useful rule for combining probabilities called Bayes' theorem. Remember that questions about mutual exclusiveness use the word “or” and the addition rule; questions about independence use the word “and” and the multiplication rule. We use letters to represent events; A, B, C, and D are four different events with probability P(A), P(B), P(C), and P(D).

The addition rule for the occurrence of either of two or more events is as follows: If A, B, and C are mutually exclusive, then

If two events such as A and D are not mutually exclusive, then

The multiplication rule for the occurrence of both of two or more events is as follows: If A, B, and C are independent, then

If two events such as B and D are not independent, then

The multiplication rule for probabilities when events are not independent can be used to derive one form of an important formula called Bayes' theorem. Because P(B and D) equals both P(B | D) × P(D) and P(B) × P(D | B), these latter two expressions are equal. AssumingP(B) and P(D) are not equal to zero, we can solve for one in terms of the other, as follows:

which is found by dividing both sides of the equation by P(D). Similarly,

In the equation for P(B | D), P(B) in the right-hand side of the equation is sometimes called the prior probability, because its value is known prior to the calculation; P(B | D) is called the posterior probability, because its value is known only after the calculation.

The two formulas of Bayes' theorem are important because investigators frequently know only one of the pertinent probabilities and must determine the other. Examples are diagnosis and management, discussed in detail in Chapter 12.

A Comment on Terminology

Although in everyday use the terms probability, odds, and likelihood are sometimes used synonymously, mathematicians do not use them that way. Odds is defined as the probability that an event occurs divided by the probability the event does not occur. For example, the odds that a person has blood type O are 0.42/ (1 – 0.42) = 0.72 to 1, but “to 1” is not always stated explicitly. This interpretation is consistent with the meaning of the odds ratio, discussed in Chapter 3. It is also consistent with the use of odds in gaming events such as football games and horse races.

Likelihood may be related to Bayes' theorem for conditional probabilities. Suppose a physician is trying to determine which of three likely diseases a patient has: myocardial infarction, pneumonia, or reflux esophagitis. Chest pain can appear with any one of these three diseases; and the physician needs to know the probability that chest pain occurs with myocardial infarction, the probability that chest pain occurs with pneumonia, and the probability that chest pain occurs with reflux esophagitis. The probabilities of a given outcome (chest pain) when evaluated under different hypotheses (myocardial infarction, pneumonia, and reflux esophagitis) are called likelihoods of the hypotheses (or diseases).

POPULATIONS & SAMPLES

A major purpose of doing research is to infer, or generalize, from a sample to a larger population. This process of inference is accomplished by using statistical methods based on probability. Population is the term statisticians use to describe a large set or collection of items that have something in common. In the health field, population generally refers to patients or other living organisms, but the term can also be used to denote collections of inanimate objects, such as sets of autopsy reports, hospital charges, or birth certificates. A sample is a subset of the population, selected so as to be representative of the larger population.

There are many good reasons for studying a sample instead of an entire population, and the four commonly used methods for selecting a sample are discussed in this section. Before turning to those topics, however, we note that the term “population” is frequently misused to describe what is, in fact, a sample. For example, researchers sometimes refer to the “population of patients in this study.” After you have read this book, you will be able to spot such errors when you see them in the medical literature. If you want more information, Levy and Lemeshow (1999) provide a comprehensive treatment of sampling.

Reasons for Sampling

There are at least six reasons to study samples instead of populations:

1. Samples can be studied more quickly than populations. Speed can be important if a physician needs to determine something quickly, such as a vaccine or treatment for a new disease.

2. A study of a sample is less expensive than studying an entire population, because a smaller number of items or subjects are examined. This consideration is especially important in the design of large studies that require a lengthy follow-up.

3. A study of an entire population (census) is impossible in most situations. Sometimes, the process of the study destroys or depletes the item being studied. For example, in a study of cartilage healing in limbs of rats after 6 weeks of limb immobilization, the animals may be sacrificed in order to perform histologic studies. On other occasions, the desire is to infer to future events, such as the study of men with prostate cancer. In these cases, a study of a population is impossible.

4. Sample results are often more accurate than results based on a population. For samples, more time and resources can be spent on training the people who perform observations and collect data. In addition, more expensive procedures that improve accuracy can be used for a sample because fewer procedures are required.

5. If samples are properly selected, probability methods can be used to estimate the error in the resulting statistics. It is this aspect of sampling that permits investigators to make probability statements about observations in a study.

6. Samples can be selected to reduce heterogeneity. For example, systemic lupus erythematosus (SLE) has many clinical manifestations, resulting in a heterogeneous population. A sample of the population with specified characteristics is more appropriate than the entire population for the study of certain aspects of the disease.

To summarize, bigger does not always mean better in terms of sample sizes. Thus, investigators must plan the sample size appropriate for their study prior to beginning research. This process is called determining the power of a study and is discussed in detail in later chapters. See Abramson (1999) for an introductory discussion of sampling.

Methods of Sampling

The best way to ensure that a sample will lead to reliable and valid inferences is to use probability samples, in which the probability of being included in the sample is known for each subject in the population. Four commonly used probability sampling methods in medicine are simple random sampling, systematic sampling, stratified sampling, and cluster sampling, all of which use random processes.

The following example illustrates each method: Consider a physician applying for a grant for a study that involves measuring the tracheal diameter on radiographs. The physician wants to convince the granting agency that these measurements are reliable. To estimateintrarater reliability, the physician will select a sample of chest x-ray films from those performed during the previous year, remeasure the tracheal diameter, and compare the new measurement with the original one on file in the patient's chart. The physician has a population of 3400 radiographs, and we assume that the physician has learned that a sample of 200 films is sufficient to provide an accurate estimate of intrarater reliability. Now the physician must select the sample for the reliability study.

Simple Random Sampling

simple random sample is one in which every subject (every film in the example) has an equal probability of being selected for the study. The recommended way to select a simple random sample is to use a table of random numbers or a computer-generated list of random numbers. For this approach, each x-ray film must have an identification (ID) number, and a list of ID numbers, called a sampling frame, must be available. For the sake of simplicity, assume that the radiographs are numbered from 1 to 3400. Using a random number table, after first identifying a starting place in the table at random, the physician can select the first 200 digits between 1 and 3400. The x-ray films with the ID numbers corresponding to 200 random numbers make up the simple random sample. If a computer-generated list of random numbers is available, the physician can request 200 numbers between 1 and 3400. To illustrate the process with a random number table, a portion of Table A–1 in Appendix A is reproduced as Table 4-4. One way to select a starting point is by tossing a die to select a row and a column at random. Tossing a die twice determines, first, which block of rows and, second, which individual row within the block contains our number. For example, if we throw a 2 and a 3, we begin in the second block down, third row, beginning with the number 83. (If, on our second throw, we had thrown a 6, we would toss the die again, because there are only five rows.) Now, we must select a beginning column at random, again by tossing the die twice to select a block and a column within the block. For example, if we toss a 3 and a 1, we use the third block (across) of columns and the first column, headed by the number 1. The starting point in this example is therefore located where the row beginning with 83 and the column beginning with 1 intersect at the number 6 (underlined in Table 4-4).

Because there are 3400 radiographs, we must read four-digit numbers; the first ten numbers are 6221, 7678, 9781, 2624, 8060, 7562, 5288, 1071, 3988, and 8549. The numbers less than 3401 are the IDs of the films to be used in the sample. In the first ten numbers selected, only two are less than 3401; so we use films with the ID numbers 2624 and 1071. This procedure continues until we have selected 200 radiographs. When the number in the bottom row (7819) is reached, we go to the top of that same column and move one digit to the right for numbers 6811, 1465, 3226, and so on.

If a number less than 3401 occurs twice, the x-ray film with that ID number can be selected for the sample and used in the study a second time (called sampling with replacement). In this case, the final sample of 200 will be 200 measurements rather than 200 radiographs. Frequently, however, when a number occurs twice, it is ignored the second time and the next eligible number is used instead (called sampling without replacement). The differences between these two procedures are negligible when we sample from a large population.

Systematic Sampling

systematic random sample is one in which every kth item is selected; k is determined by dividing the number of items in the sampling frame by the desired sample size. For example, 3400 radiographs divided by 200 is 17, so every 17th x-ray film is sampled. In this approach, we must select a number randomly between 1 and 17 first, and we then select every 17th film. Suppose we randomly select the number 12 from a random number table. Then, the systematic sample consists of radiographs with ID numbers 12, 29, 46, 63, 80, and so on; each subsequent number is determined by adding 17 to the last ID number.

Systematic sampling should not be used when a cyclic repetition is inherent in the sampling frame. For example, systemic sampling is not appropriate for selecting months of the year in a study of the frequency of different types of accidents, because some accidents occur more often at certain times of the year. For instance, skiing injuries and automobile accidents most often occur in cold-weather months, whereas swimming injuries and farming accidents most often occur in warm-weather months.

Table 4-4. Random numbers.

 927415 956121 168117 169280 326569 266541 926937 515107 014658 159944 821115 317592 867169 388342 832261 993050 639410 698969 867169 542747 032683 131188 926198 371071 512500 843384 085361 398488 774767 383837 062454 423050 670884 840940 845839 979662 806702 881309 772977 367506 729850 457758 837815 163631 622143 938278 231305 219737 926839 453853 767825 284716 916182 467113 854813 731620 978100 589512 147694 389180 851595 452454 262448 688990 461777 647487 449353 556695 806050 123754 722070 935916 169116 586865 756231 469281 258737 989450 139470 358095 528858 660128 342072 681203 433775 761861 107191 515960 759056 150336 221922 232624 398839 495004 881970 792001 740207 078048 854928 875559 246288 000144 525873 755998 866034 444933 785944 018016 734185 499711 254256 616625 243045 251938 773112 463857 781983 078184 380752 492215

Stratified Sampling

stratified random sample is one in which the population is first divided into relevant strata (subgroups), and a random sample is then selected from each stratum. In the radiograph example, the physician may wish to stratify on the age of patients, because the trachea varies in size with age and measuring the diameter accurately in young patients may be difficult. The population of radiographs may be divided into infants younger than 1 year old, children from 1 year old to less than 6 years old, children from 6 to younger than 16 years old, and subjects 16 years of age or older; a random sample is then selected from each age stratum. Other commonly used strata in medicine besides age include gender of patient, severity or stage of disease, and duration of disease. Characteristics used to stratify should be related to the measurement of interest, in which case stratified random sampling is the most efficient, meaning that it requires the smallest sample size.

Cluster Sampling

cluster random sample results from a two-stage process in which the population is divided into clusters and a subset of the clusters is randomly selected. Clusters are commonly based on geographic areas or districts, so this approach is used more often in epidemiologic research than in clinical studies. For example, the sample for a household survey taken in a city may be selected by using city blocks as clusters; a random sample of city blocks is selected, and all households (or a random sample of households) within the selected city blocks are surveyed. In multicenter trials, the institutions selected to participate in the study constitute the clusters; patients from each institution can be selected using another random-sampling procedure. Cluster sampling is somewhat less efficient than the other sampling methods because it requires a larger sample size, but in some situations, such as in multicenter trials, it is the method of choice for obtaining adequate numbers of patients.

Nonprobability Sampling

The sampling methods just discussed are all based on probability, but nonprobability sampling methods also exist, such as convenience samples or quota samples. Nonprobability samples are those in which the probability that a subject is selected is unknown and may reflect selection biases of the person doing the study; they do not fulfill the requirements of randomness needed to estimate sampling errors. When we use the term “sample” in the context of observational studies, we will assume that the sample has been randomly selected in an appropriate way.

Random Assignment

Random sampling methods are used when a sample of subjects is selected from a population of possible subjects in observational studies, such as cohort, case–control, and cross-sectional studies. In experimental studies such as randomized clinical trials, subjects are first selected for inclusion in the study on the basis of appropriate criteria; they are then assigned to different treatment modalities. If the assignment of subjects to treatments is done by using random methods, the process is called random assignment. Random assignment may also occur by randomly assigning treatments to subjects. In either case, random assignment helps ensure that the groups receiving the different treatment modalities are as similar as possible. Thus, any differences in outcome at the conclusion of the study are more likely to be the result of differences in the treatments used in the study rather than differences in the compositions of the groups.

Random assignment is best carried out by using random numbers. As an example, consider the CASS study (1983), in which patients meeting the entry criteria were divided into clinical subsets and then randomly assigned to either medical or surgical treatment. Random assignment in this study could have been accomplished by using a list of random numbers (obtained from a computer or a random number table) and assigning the random numbers to patients as they entered the trial. If a study involves several investigators at different sites, such as in a multicenter trial, the investigator preparing to enter an eligible patient in the study may call a central office to learn which treatment assignment is next. As an alternative, separately randomized lists may be generated for each site. Of course, in double-blindstudies, someone other than the investigator must keep the list of random assignments.

Suppose investigators in the CASS study wanted an equal number of patients at each site participating in the study. For this design, the assignment of random numbers might have been balanced within blocks of patients of a predetermined size. For example, balancing patients within blocks of 12 would guarantee that every time 12 patients entered the study at a given site, 6 patients received the medical treatment and 6 received the surgical treatment. Within the block of 12 patients, however, assignment would be random until 6 patients were assigned to one or the other of the treatments.

A study design may match subjects on important characteristics, such as gender, age group, or severity of disease, and then make the random assignment. This stratified assignment controls for possible confounding effects of the characteristic(s); it is equivalent to stratified random sampling in observational studies.

Many types of biases may result in studies in which patients are not randomly assigned to treatment modalities. For instance, early studies comparing medical and surgical treatment for coronary artery disease did not randomly assign patients to treatments and were criticized as a result. Some critics claimed that sicker patients were not candidates for surgery, and thus, the group receiving surgery was biased by having healthier subjects. Other critics stated that the healthier patients were given medical treatment because their disease was not as serious as that of the sicker patients. In nonrandomized studies, the problem is determining which biases are operating and which conclusions are appropriate; in fact, the CASS study was designed partly in response to these criticisms. A description of different kinds of biases that threaten the validity of studies is given in Chapter 13.

Using and Interpreting Random Samples

In actual clinical studies, patients are not always randomly selected from the population from which the investigator wishes to infer. Instead, the clinical researcher often uses all patients at hand who meet the entry criteria for the study. This practical procedure is used especially when studies involve rather uncommon conditions. Colton (1974) makes a useful distinction between the target population and the sampled population. The target population is the population to which the investigator wishes to generalize; the sampled populationis the population from which the sample was actually drawn. Figure 4-1 presents a scheme of these concepts.

For example, Shipley and colleagues (1999) in Presenting Problem 3 clearly wished to generalize their findings about survival to all men with localized prostate cancer, such as patients who live in other locations and perhaps even patients who do not yet have the disease. The sample was the set of patients at six medical centers with T1b, T1c, and T2 tumors treated between 1988 and 1995 using external beam radiation. Statistical inference permits generalization from the sample to the population sampled. In order to make inferences from the population sampled to the target population, we must ask whether the population sampled is representative of the target population. A population (or sample) is representative of the target population if the distribution of important characteristics in the sampled population is the same as that in the target population. This judgment is clinical, not statistical. It points to the importance of always reading the Method section of journal articles to learn what population was actually sampled so that you can determine the representativeness of that population.

 Figure 4-1. Target and sampled populations.

Population Parameters & Sample Statistics

Statisticians use precise language to describe characteristics of populations and samples. Measures of central tendency and variation, such as the mean and the standard deviation, are fixed and invariant characteristics in populations and are called parameters. In samples, however, the observed mean or standard deviation calculated on the basis of the sample information is actually an estimate of the population mean or standard deviation; these estimates are called statistics. Statisticians customarily use Greek letters for population parameters and Roman letters for sample statistics. Some of the frequently encountered symbols used in this text are summarized in Table 4-5.

RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS

The characteristic of interest in a study is called a variable. Diermayer and colleagues (1999) examined several variables for the study described in Presenting Problem 1, such as sex, race, site of infection, and period in the epidemic during which the patient contracted the disease. The term “variable” makes sense because the value of the characteristic varies from one subject to another. This variation results from inherent biologic variation among individuals and from errors, called measurement errors, made in measuring and recording a subject's value on a characteristic. A random variable is a variable in a study in which subjects are randomly selected. If the subjects in Presenting Problem 1 are a random sample selected from a larger population of citizens, then sex, race, and site of infection are examples of random variables.

Table 4-5. Commonly used symbols for parameters and statistics.

 Characteristic Parameter Symbol Statistical Symbol Mean ľ X̅ Standard deviation σ SD Variance σ2 s2 Correlation ρ r Proportion π ρ

Just as values of characteristics, such as site of infection or PSA level, can be summarized in frequency distributions, values of a random variable can be summarized in a frequency distribution called a probability distribution. For example, if X is a random variable defined as the PSA level prior to treatment in Presenting Problem 3, X can take on any value between 0.1 and 2500; and we can determine the probability that the random variable X has any given value or range of values. For instance, from Box 4-1, the probability that X < 10 is 130/799, or 0.163 (or about 1 in 6). In some applications, a formula or rule will adequately describe a distribution; the formula can then be used to calculate the probability of interest. In other situations, a theoretical probability distribution provides a good fit to the distribution of the variable of interest.

Several theoretical probability distributions are important in statistics, and we shall examine three that are useful in medicine. Both the binomial and the Poisson are discrete probability distributions; that is, the associated random variable takes only integer values, 0, 1, 2,…, n.The normal (gaussian) distribution is a continuous probability distribution; that is, the associated random variable has values measured on a continuous scale. We will examine the binomial and Poisson distributions briefly, using examples from the presenting problems to illustrate each; then we will discuss the normal distribution in greater detail.

The Binomial Distribution

Suppose an event can have only binary outcomes (eg, yes and no, or positive and negative), denoted A and B. The probability of A is denoted by π, or P(A) = π, and this probability stays the same each time the event occurs. The probability of B must therefore be 1 – π, because B occurs if A does not. If an experiment involving this event is repeated n times and the outcome is independent from one trial to another, what is the probability that outcome A occurs exactly X times? Or equivalently, what proportion of the n outcomes will be A? These questions frequently are of interest, especially in basic science research, and they can be answered with the binomial distribution.

Box 4-1. Estimated rates of no biochemical recurrence according to pretreatment prostate-specific antigen values.

Figure. No Caption available.

Number of Patients at Risk by Pretreatment PSA Valuesa

 Total 1607 1600 1552 1176 804 448 237 104 39 14 1 799 795 775 587 387 233 119 57 23 6 2 419 418 407 303 209 98 49 20 4 1 3 163 162 158 126 93 50 25 10 5 3 4 226 225 212 160 115 67 44 17 7 4 aData represent 1607 patients with stage T1b, T1c, T2, and NX tumors; P < 0.001 for all groups. PSA = prostate-specific antigen.Source: Reproduced, with permission, from Shipley WU, Thomas HD, Sandler HM, Hanks GE, Zietman AL, Perez CA, et al: Radiation therapy for clinically localized prostate cancer: A multiinstitutional pooled analysis. JAMA 1999;281:1598–1604.

Basic principles of the binomial distribution were developed by the 17th century Swiss mathematician Jakob Bernoulli, who made many contributions to probability theory. He was the author of what is generally acknowledged as the first book devoted to probability, published in 1713. In fact, in his honor, each trial involving a binomial probability is sometimes called a Bernoulli trial, and a sequence of trials is called a Bernoulli process. The binomial distribution gives the probability that a specified outcome occurs in a given number of independent trials. The binomial distribution can be used to model the inheritability of a particular trait in genetics, to estimate the occurrence of a specific reaction (eg, the single packet, or quantal release, of acetylcholine at the neuromuscular junction), or to estimate the death of a cancer cell in an in vitro test of a new chemotherapeutic agent.

We use the information collected by Shipley and colleagues (1999) in Presenting Problem 3 to illustrate the binomial distribution. Assume, for a moment, that the entire population of men with a localized prostate tumor and a pretreatment PSA < 10 has been studied, and the probability of 5-year survival is equal to 0.8 (we use 0.8 for computational convenience, rather than 0.81 as reported in the study). Let Srepresent the event of 5-year survival and D represent death before 5 years; then, π = P(S) = 0.8 and 1 – π = P(D) = 0.2. Consider a group of n = 2 men with a localized prostate tumor and a pretreatment PSA < 10. What is the probability that exactly two men live 5 years? That exactly one lives 5 years? That none lives 5 years? These probabilities are found by using the multiplication and addition rules outlined earlier in this chapter.

The probability that exactly two men live 5 years is found by using the multiplication rule for independent events. We know that P(S) = 0.8 for patient 1 and P(S) = 0.8 for patient 2. Because the survival of one patient is independent from (has no effect on) the survival of the other patient, the probability of both surviving is

The event of exactly one patient living 5 years can occur in two ways: patient 1 survives 5 years and patient 2 does not, or patient 2 survives 5 years and patient 1 does not. These two events are mutually exclusive; therefore, after using the multiplication rule to obtain the probability of each event, we can use the addition rule for mutually exclusive events to combine the probabilities as follows:

These computational steps are summarized in Table 4-6. Note that the total probability is

which you may recognize as the binomial formula, (a + b)2 = a2 + 2ab + b2.

The same process can be applied for a group of patients of any size or for any number of trials, but it becomes quite tedious. An easier technique is to use the formula for the binomial distribution, which follows. The probability of X outcomes in a group of size n, if each outcome has probability π and is independent from all other outcomes, is given by

where ! is the symbol for factorial; n! is called n factorial and is equal to the product n(n – 1)(n – 2)…(3)(2)(1). For example, 4! = (4)(3)(2)(1) = 24. The number 0! is defined as 1. The symbol πX indicates that the probability is raised to the power X, and (1 – π)n–X means that 1 minus the probability is raised to the power n – X. The expression n!/[X!(n – X)!] is sometimes referred to as the formula for combinations because it gives the number of combinations (or assortments) of X items possible among the n items in the group.

 Table 4-6. Summary of probabilities for two patients.

To verify that the probability that exactly X = 1 of n = 2 patients survives 5 years is 0.32, we use the formula:

To summarize, the binomial distribution is useful for answering questions about the probability of X number of occurrences in n independent trials when there is a constant probability π of success on each trial. For example, suppose a new series of men with prostate tumors is begun with ten patients. We can use the binomial distribution to calculate the probability that any particular number of them will survive 5 years. For instance, the probability that all ten will survive 5 years is

Similarly, the probability that exactly eight patients will survive 5 years is

Table 4-7 lists the probabilities for X = 0, 1, 2, 3, …, 10; a plot of the binomial distribution when n = 10 and π = 0.8 is given in Figure 4-2. The mean of the binomial distribution is ; so (10)(0.8) = 8 is the mean number of patients surviving 10 years in this example. The standard deviation is

Table 4-7. Probabilities for binomial distribution with n = 10 and π = 0.8.

 Number of Patients Surviving πX (1 – π)n – X P (X)a 0 1 1 0.0000001 0 1 10 0.8 0.0000005 0 2 45 0.64 0.0000026 0.0001 3 120 0.512 0.0000128 0.0008 4 210 0.410 0.000064 0.0055 5 252 0.328 0.00032 0.0264 6 210 0.262 0.0016 0.0881 7 120 0.210 0.008 0.2013 8 45 0.168 0.04 0.3020 9 10 0.134 0.2 0.2684 10 1 0.107 1 0.1074 a Rounded to four decimal places.

Thus, the only two pieces of information needed to define a binomial distribution are n and π, which are called the parameters of the binomial distribution. Studies involving dichotomous, or binary, variables often use a proportion rather than a number (eg, the proportion of patients surviving a given length of time rather than the number of patients). When a proportion is used instead of a number of successes, the same two pieces of information (n and π) are needed. Because the proportion is found by dividing X by n, however, the mean of the distribution of the proportion becomes π, and the standard deviation becomes

 Figure 4-2. Binomial distribution for n = 10 and π = 0.8.

Even using the formula for the binomial distribution becomes time-consuming, especially if the numbers are large. Also, the formula gives the probability of observing exactly X successes, and interest frequently lies in knowing the probability of X or more successes or of X or less successes. For example, to find the probability that eight or more patients will survive 5 or more years, we must use the formula to find the separate probabilities that eight will survive, nine will survive, and ten will survive and then sum these results; from Table 4-7, we obtain P(X ≥ 8) = P(X = 8) + P(X = 9) + P(X = 10) = 0.3020 + 0.2684 + 0.1074 = 0.6778. Tables giving probabilities for the binomial distribution are presented in many elementary texts. Much research in the health field is conducted with sample sizes large enough to use an approximation to the binomial distribution; this approximation is discussed in Chapter 5.

The Poisson Distribution

The Poisson distribution is named for the French mathematician who derived it, Siméon D. Poisson. Like the binomial, the Poisson distribution is a discrete distribution applicable when the outcome is the number of times an event occurs. The Poisson distributioncan be used to determine the probability of rare events; it gives the probability that an outcome occurs a specified number of times when the number of trials is large and the probability of any one occurrence is small. For instance, the Poisson distribution is used to plan the number of beds a hospital needs in its intensive care unit, the number of ambulances needed on call, or the number of operators needed on a switchboard to ensure that an adequate number of resources is available. It can also be used to model the number of cells in a given volume of fluid, the number of bacterial colonies growing in a certain amount of medium, or the emission of radioactive particles from a specified amount of radioactive material.

Consider a random variable representing the number of times an event occurs in a given time or space interval. Then the probability of exactly X occurrences is given by the formula:

in which λ (the lowercase Greek letter lambda) is the value of both the mean and the variance of the Poisson distribution, and e is the base of the natural logarithms, equal to 2.718. The term λ is called the parameter of the Poisson distribution, just as n and π are the parameters of the binomial distribution. Only one piece of information, λ, is therefore needed to characterize any given Poisson distribution.

A random variable having a Poisson distribution was used in the Coronary Artery Surgery Study (Rogers et al, 1990) summarized in Presenting Problem 4. The number of hospitalizations for each group of patients (medical and surgical) followed Poisson distributions. This model is appropriate because the chance that a patient goes into the hospital during any one time interval is small and can be assumed to be independent from patient to patient. After mean follow-up of 11 years, the 390 patients randomized to the medical group were hospitalized a total of 1256 times; the 390 patients randomized to the surgical group were hospitalized a total of 1487 times. The mean number of hospitalizations for medical patients is 1256/390 = 3.22, and the mean for the surgical patients is 1487/390 = 3.81. We can use this information and the formula for the Poisson model to calculate probabilities of numbers of hospitalizations. For example, the probability that a patient in the medical group has zero hospitalizations is

The probability that a patient has exactly one hospitalization is

The calculations for the Poisson distribution when λ = 3.22 and X = 0, 1, 2,…, 7 are given in Table 4-8.

Figure 4-3 is a graph of the Poisson distribution for λ = 3.22. The mean of the distribution is between 3 and 4 (actually, it is 3.22). Note the slight positive skew of the Poisson distribution; the skew becomes more pronounced as λ becomes smaller.

The Normal (Gaussian) Distribution

We now turn to the most famous probability distribution in statistics, called the normal, or gaussian, distribution (or bell-shaped curve). The normal curve was first discovered by French mathematician Abraham de Moivre and published in 1733. Two mathematician-astronomers, however, Pierre-Simon Laplace from France and Karl Friedrich Gauss from Germany, were responsible for establishing the scientific principles of the normal distribution. Many consider Laplace to have made the greatest contributions to probability theory, but Gauss' name was given to the distribution after he applied it to the theory of the motions of heavenly bodies. Some statisticians prefer to use the term gaussian instead of “normal” because the latter term has the unfortunate (and incorrect) connotation that the normal curve describes the way characteristics are distributed in populations composed of “normal”—as opposed to sick—individuals. We use the term “normal” in this text, however, because it is more frequently used in the medical literature.

Figure 4-3. Poisson distribution for λ = 3.22.

Table 4-8. Probabilities for Poisson distribution with λ = 3.22.

 Number of Hospitalizations 3.22a e-3.22 X P(X)a 0 1 0.040 0 0.040 1 3.22 0.040 1 0.129 2 10.37 0.040 2 0.207 3 33.39 0.040 6 0.223 4 107.50 0.040 24 0.179 5 346.16 0.040 120 0.115 6 1114.64 0.040 720 0.062 7 3589.15 0.040 5040 0.028 a Rounded to three decimal places.

Describing the Normal Distribution

The normal distribution is continuous, so it can take on any value (not just integers, as do the binomial and Poisson distributions). It is a smooth, bell-shaped curve and is symmetric about the mean of the distribution, symbolized by ľ (Greek letter mu). The curve is shown inFigure 4-4. The standard deviation of the distribution is symbolized by σ (Greek letter sigma); σ is the horizontal distance between the mean and the point of inflection on the curve. The point of inflection is the point where the curve changes from convex to concave. The mean and the standard deviation (or variance) are the two parameters of the normal distribution and completely determine the location on the number line and the shape of a normal curve. Thus, many different normal curves are possible, one each for every value of the mean and the standard deviation. Because the normal distribution is a probability distribution, the area under the curve is equal to 1. (Recall that one of the properties of probability is that the sum of the probabilities for any given set of events is equal to 1.) Because it is a symmetric distribution, half the area is on the left of the mean and half is on the right.

 Figure 4-4. Normal distribution and percentage of area under the curve.

Given a random variable X that can take on any value between negative and positive infinity (–∞ and + ∞), (∞ represents infinity), the formula for the normal distribution is as follows:

where exp stands for the base e of the natural logarithm and π ≈ 3.1416. The function depends only on the mean ľ and standard deviation σ because they are the only components that vary.

Because the area under the curve is equal to 1, we can use the curve for calculating probabilities. For example, to find the probability that an observation falls between a and b on the curve in Figure 4-5, we integrate the preceding equation between a and b, where –∞ is given the value a and +∞ is given the value b. (Integration is a mathematical technique in calculus used to find area under a curve.)

The Standard Normal (z) Distribution

Fortunately, there is no need to integrate this function because tables for it are available. So that we do not need a different table for every value of ľ and σ, however, we use the standard normal curve (distribution), which has a mean of 0 and a standard deviation of 1, as shown in Figure 4-6. This curve is also called the z distribution. Table A-2 (see Appendix A) gives the area under the curve between –z and +z, the sum of the areas to the left of –z and the right of +z, and the area to either the left of –z or the right of +z.

 Figure 4-5. Area under a normal curve between a and b.

Before we use Table A–2, look at the standard normal distribution in Figure 4-6 and estimate the proportion (or percentage) of these areas:

1. Above 1

2. Below -1

3. Above 2

4. Below -2

5. Between -1 and 1

6. Between -2 and 2

Now turn to Table A–2 and find the designated areas. The answers follow.

1. 0.159 of the area is to the right of 1 (from the fourth column in Table A–2).

2. Table A–2 does not list values for z less than 0; however, because the distribution is symmetric about 0, the area below -1 is the same as the area to the right of 1, which is 0.159.

3. 0.023 of the area is to the right of 2 (from the fourth column in Table A–2).

4. The same reasoning as in answer 2 applies here; so 0.023 of the area is to the left of -2.

5. 0.683 of the area is between -1 and 1 (from the second column in Table A–2).

6. 0.954 of the area is between -2 and 2 (from the second column in Table A–2).

When the mean of a gaussian distribution is not 0 and the standard deviation is not 1, a simple transformation, called the ztransformation, must be made so that we can use the standard normal table. The z transformation expresses the deviation from the mean in standard deviation units. That is, any normal distribution can be transformed to the standard normal distribution by using the following steps:

1. Move the distribution up or down the number line so that the mean is 0. This step is accomplished by subtracting the mean ľ from the value for X.

2. Make the distribution either narrower or wider so that the standard deviation is equal to 1. This step is accomplished by dividing by σ.

To summarize, the transformed value is

and is variously called a z score, a normal deviate, a standard score, or a critical ratio.

 Figure 4-6. Standard normal (z) distribution.

Examples Using the Standard Normal Distribution

To illustrate the standard normal distribution, we consider Presenting Problem 5. To facilitate computations, we assume systolic BP in normal healthy individuals is normally distributed with ľ = 120 and σ = 10 mm Hg (rather than 119.7 and 10.7 observed in the study)(Table 4-9). Make the appropriate transformations to answer the following questions. (Hint: Make sketches of the distribution to be sure you are finding the correct area.)

1. What area of the curve is above 130 mm Hg?

 Figure 4-7. Finding areas under a curve using normal distribution.

2. What area of the curve is above 140 mm Hg?

3. What area of the curve is between 100 and 140 mm Hg?

4. What area of the curve is above 150 mm Hg?

Table 4-9. Mean blood pressures.

 Men Women Ages Systolic Diastolic Systolic Diastolic 16 115 70 112 69 19 119 71 114 70 24 122 73 115 71 29 122 75 116 72 39 123 76 118 74 49 125 78 123 76 59 128 79 128 79 69 132 79 134 80

5. What area of the curve is either below 90 mm Hg or above 150 mm Hg?

6. What is the value of the systolic BP that divides the area under the curve into the lower 95% and the upper 5%?

7. What is the value of the systolic BP that divides the area under the curve into the lower 97.5% and the upper 2.5%?

The answers, referring to the sketches in Figure 4-7, are shown in the following list.

1. z = (130 – 120)/10 = 1.00, and the area above 1.00 is 0.159. So 15.9% of normal healthy individuals have a systolic BP above 1 standard deviation (> 130 mm Hg).

2. z = (140 – 120)/10 = 2.00, and the area above 2.00 is 0.023. So 2.3% have a systolic BP above 2 standard deviations (> 140 mm Hg).

3. z1 = (100 – 120)/10 = -2.00, and z2 = (140 – 120)/10 = 2.00; the area between -2 and +2 is 0.954. So 95.4% have a systolic BP between -2 and +2 standard deviations (between 100 and 140 mm Hg).

4. z = (150 – 120)/10 = 3.00, and the area above 3.00 is 0.001. So only 0.1% have a systolic BP above 3 standard deviations (> 150 mm Hg).

5. z1 = (90 – 120)/10 = -3.00, and z2 = 3.00; the area below -3 and above +3 is 0.003. So only 0.3% have a systolic BP either below or above 3 standard deviations (< 90 or > 150 mm Hg).

6. This problem is a bit more difficult and must be worked backward. The value of z, obtained from Table A–2, that divides the lower 0.95 of the area from the upper 0.05 is 1.645. Substituting this value for z in the formula and solving for X yields

A systolic BP of 136.45 mm Hg is therefore at the 95th percentile (a more specific BP measurement than is generally made). So 95% of normal, healthy people have a systolic BP of 136.45 mm Hg or lower.

7. Working backward again, we obtain the value 1.96 for z. Substituting and solving for X yields

Thus, a systolic BP of 139.6 mm Hg divides the distribution of normal, healthy individuals into the lower 97.5% and the upper 2.5%.

From the results of the previous exercises, we can state some important guidelines for using the normal distribution. As mentioned inChapter 3, the normal distribution has the following distinguishing features:

1. The mean ą1 standard deviation contains approximately 66.7% of the area under the normal curve.

2. The mean ą2 standard deviations contains approximately 95% of the area under the normal curve.

3. The mean ą3 standard deviations contains approximately 99.7% of the area under the normal curve.

Although these features indicate that the normal distribution is a valuable tool in statistics, its value goes beyond merely describing distributions. In actuality, few characteristics are normally distributed. The systolic BP data in Presenting Problem 5 surely are not exactly normally distributed in the population at large. In some populations, data are positively skewed: More people are found with systolic pressures above 120 mm Hg than below. Elveback and coworkers (1970) showed that many common laboratory values are not normally distributed; consequently, using the mean ą2 SD may cause substantially more or less than 5% of the population to lie outside 2 standard deviations. Used judiciously, however, the three guidelines are good rules of thumb about characteristics that have approximately normal distributions.

The major importance of the normal distribution is in the role it plays in statistical inference. In the next section, we show that the normal distribution forms the basis for making statistical inferences even when the population is not normally distributed. The following point is very important and will be made several times: Statistical inference generally involves mean values of a population, not values related to individuals. The examples we just discussed deal with individuals and, if we are to make probability statements about individuals using the mean and standard deviation rules, the distribution of the characteristic of interest must be approximately normally distributed.

SAMPLING DISTRIBUTIONS

We just learned that the binomial, Poisson, and normal distributions can be used to determine how likely it is that any specific measurement is in the population. Now we turn to another type of distribution, called a sampling distribution, that is very important in statistics. Understanding sampling distributions is essential for grasping the logic underlying the prototypical statements from the literature. After we have a basic comprehension of sampling distributions, we will have the tools to learn about estimation and hypothesis testing, methods that permit investigators to generalize study results to the population that the sample represents. Throughout, we assume that the sample has been selected using one of the proper methods of random sampling discussed in the section titled, “Populations and Samples.”

The distribution of individual observations is very different from the distribution of means, which is called a sampling distribution. Gelber and colleagues (1997) collected data on heart rate variation to deep breathing and the Valsalva ratio in order to establish population norms. A national sample of 490 subjects was the basis of establishing norm values for heart rate variation, but clearly the authors wished to generalize from this sample to all healthy adults. If another sample of 490 healthy individuals were evaluated, it is unlikely that exactly this distribution would be observed.

Although the focus in this study was on the normal range, defined by the central 95% of the observed distribution, the researchers were also interested in the mean heart rate variation. The mean in another sample is likely to be less (or more) than the 50.17 observed in their sample, and they might wish to know how much the mean can be expected to differ. To find out, they could randomly select many samples from the target population of patients, compute the mean in each sample, and then examine the distribution of means to estimate the amount of variation that can be expected from one sample to another. This distribution of means is called thesampling distribution of the mean. It would be very tedious, however, to have to take many samples in order to estimate the variability of the mean. The sampling distribution of the mean has several desirable characteristics, not the least of which is that it permits us to answer questions about a mean with only one sample.

In the following section, we use a simple hypothetical example to illustrate how a sampling distribution can be generated. Then we show that we need not generate a sampling distribution in practice; instead, we can use statistical theory to answer questions about a single observed mean.

The Sampling Distribution of the Mean

Four features define a sampling distribution. The first is the statistic of interest, for example, the mean, standard deviation, or proportion. Because the sampling distribution of the mean plays such a key role in statistics, we use it to illustrate the concept. The second defining feature is a random selection of the sample. The third—and very important—feature is the size of the random sample. The fourth feature is specification of the population being sampled.

To illustrate, suppose a physician is trying to decide whether to begin mailing reminders to patients who have waited more than a year to schedule their annual examination. The physician reviews the files of all patients who came in for an annual checkup during the past month and determines how many months had passed since their previous visit. To keep calculations simple, we use a very small population size of five patients. Table 4-10 lists the number of months since the last examination for the five patients in this population. The following discussion presents details about generating and using a sampling distribution for this example.

Generating a Sampling Distribution

To generate a sampling distribution from the population of five patients, we select all possible samples of two patients per sample and calculate the mean number of months since the last examination for each sample. For a population of five, 25 different possible samples of two can be selected. That is, patient 1 (12 months since last checkup) can be selected as the first observation and returned to the sample; then, patient 1 (12 months), or patient 2 (13 months), or patient 3 (14 months), and so on, can be selected as the second observation. The 25 different possible samples and the mean number of months since the patient's last visit for each sample are given in Table 4-11.

Table 4-10. Population of months since last examination.

 Patient Number of Months Since Last Examination 1 12 2 13 3 14 4 15 5 16

Comparing the Population Distribution with the Sampling Distribution

Figure 4-8 is a graph of the population of patients and the number of months since their last examination. The probability distribution in this population is uniform, because every length of time has the same (or uniform) probability of occurrence; because of its shape, this distribution is also referred to as rectangular. The mean in this population is 14 months, and the standard deviation is 1.41 months (seeExercise 8).

Figure 4-8. Distribution of population values of number of months since last office visit (data from Table 4-10).

Table 4-11. Twenty-five samples of size 2 patients each.

 Sample Patients Selected Number of Months for Each Mean 1 1, 1 12, 12 12.0 2 1, 2 12, 13 12.5 3 1, 3 12, 14 13.0 4 1, 4 12, 15 13.5 5 1, 5 12, 16 14.0 6 2, 1 13, 12 12.5 7 2, 2 13, 13 13.0 8 2, 3 13, 14 13.5 9 2, 4 13, 15 14.0 10 2, 5 13, 16 14.5 11 3, 1 14, 12 13.0 12 3, 2 14, 13 13.5 13 3, 3 14, 14 14.0 14 3, 4 14, 15 14.5 15 3, 5 14, 16 15.0 16 4, 1 15, 12 13.5 17 4, 2 15, 13 14.0 18 4, 3 15, 14 14.5 19 4, 4 15, 15 15.0 20 4, 5 15, 16 15.5 21 5, 1 16, 12 14.0 22 5, 2 16, 13 14.5 23 5, 3 16, 14 15.0 24 5, 4 16, 15 15.5 25 5, 5 16, 16 16.0

Figure 4-9 is a graph of the sampling distribution of the mean number of months since the last visit for a sample of size 2. The sampling distribution of means is certainly not uniform; it is shaped somewhat like a pyramid. The following are three important characteristics of this sampling distribution:

1. The mean of the 25 separate means is 14 months, the same as the mean in the population.

2. The variability in the sampling distribution of means is less than the variability in the original population. The standard deviation in the population is 1.41; the standard deviation of the means is 1.00.

3. The shape of the sampling distribution of means, even for a sample of size 2, is beginning to “approach” the shape of the normal distribution, although the shape of the population distribution is rectangular, not normal.

Using the Sampling Distribution

The sampling distribution of the mean is extremely useful because it allows us to make statements about the probability that specific observations will occur. For example, using the sampling distribution in Figure 4-9, we can ask questions such as “If the mean number of months since the previous checkup is really 14, how likely is a random sample of n = 2 patients in which the mean is 15 or more months?” From the sampling distribution, we see that a mean of 15 or more can occur 6 times out of 25, or 24% of the time. A random sample with a mean of 15 or more is therefore not all that unusual.

 Figure 4-9. Distribution of mean number of months since last office visit for n = 2 (data from Table 4-11).

In medical studies, the sampling distribution of the mean can answer questions such as “If there really is no difference between the therapies, how often would the observed outcome (or something more extreme) occur simply by chance?”

The Central Limit Theorem

Generating the sampling distribution for the mean each time an investigator wants to ask a statistical question would be too time-consuming, but this process is not necessary. Instead, statistical theory can be used to determine the sampling distribution of the mean in any particular situation. These properties of the sampling distribution are the basis for one of the most important theorems in statistics, called the central limit theorem. A mathematical proof of the central limit theorem is not possible in this text, but we will advance some empirical arguments that hopefully convince you that the theory is valid. The following list details the features of thecentral limit theorem.

Given a population with mean ľ and standard deviation σ, the sampling distribution of the mean based on repeated random samples of size nhas the following properties:

1. The mean of the sampling distribution, or the mean of the means, is equal to the population mean ľ based on the individual observations.

2. The standard deviation in the sampling distribution of the mean is equal to

This quantity, called the standard error of the mean, plays an important role in many of the statistical procedures discussed in several later chapters. The standard error of the mean is variously written as

or sometimes simply SE, if it is clear the mean is being referred to.

3. If the distribution in the population is normal, then the sampling distribution of the mean is also normal. More importantly, for sufficiently large sample sizes, the sampling distribution of the mean is approximately normally distributed, regardless of the shape of the original population distribution.

The central limit theorem is illustrated for four different population distributions in Figure 4-10. In row A, the shape of the population distribution is uniform, or rectangular, as in our example of the number of months since a previous physical examination. Row B is a bimodal distribution in which extreme values of the random variable are more likely to occur than middle values. Results from opinion polls in which people rate their agreement with political issues sometimes have this distribution, especially if the issue polarizes people. Bimodal distributions also occur in biology when two populations are mixed, as they are for ages of people who have Crohn's disease. Modal ages for these populations are mid-20s and late 40s to early 50s. In row C, the distribution is negatively skewed because of some small outlying values. This distribution can model a random variable, such as age of patients diagnosed with breast cancer. Finally, row D is similar to the normal distribution.

The second column of distributions in Figure 4-10 illustrates the sampling distributions of the mean when samples of size 2 are randomly selected from the parent populations. In row A, the pyramid shape is the same as in the example on months since a patient's last examination. Note that, even for the bimodal population distribution in row B, the sampling distribution of means begins to approach the shape of the normal distribution. This bell shape is more evident in the third column of Figure 4-10, in which the sampling distributions are based on sample sizes of 10. Finally, in the fourth row, for sample sizes of 30, all sampling distributions resemble the normal distribution.

A sample of 30 is commonly used as a cutoff value because sampling distributions of the mean based on sample sizes of 30 or more are considered to be normally distributed. A sample this large is not always needed, however. If the parent population is normally distributed, the means of samples of any size will be normally distributed. In nonnormal parent populations, large sample sizes are required with extremely skewed population distributions; smaller sample sizes can be used with moderately skewed distributions. Fortunately, guidelines about sample sizes have been developed, and they will be pointed out as they arise in our discussion.

In Figure 4-10, also note that in every case the mean of the sampling distributions is the same as the mean of the parent population distribution. The variability of the means decreases as the sample size increases, however, so the standard error of the mean decreases as well. Another feature to note is that the relationship between sample size and standard error of the mean is not linear; it is based on the square root of the sample size, not the sample size itself. It is therefore necessary to quadruple, not double, the sample size in order to reduce the standard error by half.

Points to Remember

Several points deserve reemphasis. In practice, selecting repeated samples of size n and generating a sampling distribution for the mean is not necessary. Instead, only one sample is selected, the sample mean is calculated (as an estimate of the population mean), and, if the sample size is 30 or more, the central limit theorem is invoked to argue that the sampling distribution of the mean is known and does not need to be generated. Then, because the mean has a known distribution, statistical questions can be addressed.

Standard Deviation versus Standard Error

The value σ measures the standard deviation in the population and is based on measurements of individuals. That is, the standard deviation tells us how much variability can be expected among individuals. The standard error of the mean, however, is the standard deviation of the means in a sampling distribution; it tells us how much variability can be expected among means in future samples.

For example, earlier in this chapter we used the fact that systolic BP is approximately normally distributed in normal healthy populations, with mean 120 mm Hg and standard deviation 10, to illustrate how areas under the curve are related to probabilities. We also demonstrated that the interval defined by the mean ą2 SD contains approximately 95% of the individual observations when the observations have a normal distribution. Because the central limit theorem tells us that a sample mean is normally distributed (when the sample size is 30 or more), we can use these same properties to relate areas under the normal curve to probabilities when the sample mean instead of an individual value is of interest. Also, we will soon see that the interval defined by the sample mean ą2 SE generally contains about 95% of the means (not the individuals) that would be observed if samples of the same size were repeatedly selected.

The Use of the Standard Deviation in Research Reports

Authors of research reports sometimes present data in terms of the mean and standard deviation. At other times, authors report the mean and standard error of the mean.

This practice is especially prominent in graphs. Although some journal editors now require authors to use the standard deviation (Bartko, 1985), many articles still use the standard error of the mean. There are two reasons for increasing use of the standard deviation instead of the standard error. First, the standard error is a function of the sample size, so it can be made smaller simply by increasing n. Second, the interval (mean ą2 SE) will contain approximately 95% of the means of samples, but it will never contain 95% of the observations onindividuals; in the latter situation, the mean ą2 SD is needed. By definition, the standard error pertains to means, not to individuals. When physicians consider applying research results, they generally wish to apply them to individuals in their practice, not to groups of individuals. The standard deviation is therefore generally the more appropriate measure to report.

 Figure 4-10. Illustration of ramifications of central limit theorem.

Other Sampling Distributions

Statistics other than the mean, such as standard deviations, medians, proportions, and correlations, also have sampling distributions. In each case, the statistical issue is the same: How can the statistic of interest be expected to vary across different samples of the same size?

Although the sampling distribution of the mean is approximately normally distributed, the sampling distributions of most other statistics are not. In fact, the sampling distribution of the mean assumes that the value of the population standard deviation σ is known. In actuality, it is rarely known; therefore, the population standard deviation is estimated by the sample standard deviation SD, and the SD is used in place of the population value in the calculation of the standard error; that is, the standard error in the population is estimated by

When the SD is used, the sampling distribution of the mean actually follows a t distribution instead of the normal distribution. This important distribution is similar to the normal distribution and is discussed in detail in Chapters 5 and 6.

As other examples, the sampling distribution of the ratio of two variances (squared standard deviations) follows an F distribution, a theoretical distribution presented in Chapters 6 and 7. The proportion, which is based on the binomial distribution, is normally distributed under certain circumstances, as we shall see in Chapter 5. For the correlation to follow the normal distribution, a transformation must be applied, as illustrated in Chapter 8. Nevertheless, one property that all sampling distributions have in common is having a standard error, and the variation of the statistic in its sampling distribution is called the standard error of the statistic. Thus, the standard error of the mean is just one of many standard errors, albeit the one most commonly used in medical research.

Applications Using the Sampling Distribution of the Mean

Let us turn to some applications of the concepts introduced so far in this chapter. Recall that the critical ratio (or z score) transforms a normally distributed random variable with mean ľ and standard deviation σ to the standard normal (z) distribution with mean 0 and standard deviation 1 by subtracting the mean and dividing by the standard deviation:

When we are interested in the mean rather than individual observations, the mean itself is the entity transformed. According to the central limit theorem, the mean of the sampling distribution is still ľ, but the standard deviation of the mean is the standard error of the mean. The critical ratio that transforms a mean to have distribution with mean 0 and standard deviation 1 is therefore

The use of the critical ratio is illustrated in the following examples.

Example 1: Suppose a health care provider studies a randomly selected group of 25 men and women between 20 and 39 years of age and finds that their mean systolic BP is 124 mm Hg. How often would a sample of 25 patients have a mean systolic BP this high or higher? Using the data from Presenting Problem 5 on mean BP (Society of Actuaries, 1980) as a guide (see Table 4-9), we assume that systolic BP is a normally distributed random variable with a known mean of 120 mm Hg and a standard deviation of 10 mm Hg in the population of normal healthy adults. The provider's question is equivalent to asking: If repeated samples of 25 individuals are randomly selected from the population, what proportion of samples will have mean values greater than 124 mm Hg?

Solution: The sampling distribution of the mean is normal because the population of BPs is normally distributed. The mean is 120 mm Hg, and the SE (based on the known standard deviation) is equal to X̅ = 10/5 = 2. Therefore, the critical ratio is

From column 4 of Table A–2 (Appendix A) for the normal curve, the proportion of the z distribution area above 2.0 is 0.023; therefore, 2.3% of random samples with n = 25 can be expected to have a mean systolic BP of 124 mm Hg or higher.Figure 4-11A illustrates how the distribution of means is transformed to the critical ratio.

Example 2: Suppose a health care provider wants to detect adverse effects on systolic BP in a random sample of 25 patients using a drug that causes vasoconstriction. The provider decides that a mean systolic BP in the upper 5% of the distribution is cause for alarm; therefore, the provider must determine the value that divides the upper 5% of the sampling distribution from the lower 95%.

Solution: The solution to this example requires working backward from the area under the standard normal curve to find the value of the mean. The value of z that divides the area into the lower 95% and the upper 5% is 1.645 (we find 0.05 in column 4 of Table A–2 and read 1.645 in column 1). Substituting this value for z in the critical ratio and then solving for the mean yields

A mean systolic BP of 123.29 is the value that divides the sampling distribution into the lower 95% and the upper 5%. So, there is cause for alarm if the mean in the sample of 25 patients surpasses this value (see Figure 4-11B).

 Figure 4-11. Using normal distribution to draw conclusions about systolic BP in healthy adults.

Example 3: Continuing with Examples 1 and 2, suppose the health care provider does not know how many patients should be included in a study of the drug's effect. After some consideration, the provider decides that, 90% of the time, the mean systolic BP in the sample of patients must not rise above 122 mm Hg. How large a random sample is required so that 90% of the means in samples of this size will be 122 mm Hg or less?

Solution: The answer to this question requires determining n so that only 10% of the sample means exceed ľ = 120 by 2 or more, that is, X– ľ = 2. The value of z in Table A–2 that divides the area into the lower 90% and the upper 10% is 1.28. Using z = 1.28 and solving for nyields

Thus, a random sample of 41 individuals is needed for a sampling distribution of means in which no more than 10% of the mean systolic BPs are above 122 mm Hg (see Figure 4-11C).

Example 4: A study by Gelber and colleagues (1997) found a mean heart rate variation of 49.7 beats/min with a standard deviation of 23.4 in 580 normal healthy subjects. What proportion of individuals can be expected to have a heart rate variation between 27 and 73, assuming a normal distribution?

Solution: This question involves individuals, and the critical ratio for individual values of X must be used. To simplify calculations, we round off the mean to 50 and the standard deviation to 23. The transformed values of the z distribution for X = 27 and X = 73 are

The proportion of area under the normal curve between -1 and +1, from Table A–2, column 2, is 0.683. Therefore, 68.3% of normal healthy individuals can be expected to have a heart rate variation between 27 and 73 (Figure 4-12A).

Example 5: If repeated samples of six healthy individuals are randomly selected in the Gelber study, what proportion will have a mean Po2between 27 and 73 beats/min?

Solution: This question concerns means, not individuals, so the critical ratio for means must be used to find appropriate areas under the curve. For X̅ = 27,

Similarly, for X̅ = 73, z = +2.5. We must therefore find the area between -2.5 and +2.5. From Table A–2, the area is 0.988. Therefore, 98.8% of the area lies between ą2.5, and 98.8% of the mean heart rate variation values in samples with six subjects will fall between 27 and 73 beats/min (see Figure 4-12B).

Examples 4 and 5 illustrate the contrast between drawing conclusions about individuals and drawing conclusions about means.

Example 6: For 100 healthy individuals in repeated samples, what proportion of the samples will have mean values between 27 and 73 beats/min?

Solution: We will not do computations for this example; from the previous calculations, we can see that the proportion of means is very large. (The z values are ą10, which go beyond the scale of Table A–2.)

Example 7: What mean value of heart rate variation divides the sampling distribution for 16 individuals into the central 95% and the upper and lower 2.5%?

Solution: The value of z is ą1.96 from Table A–2. First we substitute -1.96 in the critical ratio to get

Similarly, using +1.96 gives X = 61.27. Thus, 61.27 beats/min divides the upper 2.5% of the sampling distribution of heart rate variation from the remainder of the distribution, and 38.73 beats/min divides the lower 2.5% from the remainder (see Figure 4-12C).

Example 8: What size sample is needed to ensure that 95% of the sample means for heart rate variation will be within 3 beats/min of the population mean?

Solution: To obtain the central 95% of any normal distribution, we use z = 1.96, as in Example 7. Substituting 1.96 into the formula for zand solving for n yields

Thus, a sample of 226 individuals is needed to ensure that 95% of the sample means are within 3 beats/min of the population mean. Note that sample sizes are always rounded up to the next whole number.

 Figure 4-12. Using normal distribution to draw conclusions about levels of Po2 in healthy adults.

These examples illustrate how the normal distribution can be used to draw conclusions about distributions of individuals and of means. Although some questions were deliberately contrived to illustrate the concepts, the important point is to understand the logic involved in these solutions. The exercises provide additional practice in solving problems of these types.

ESTIMATION & HYPOTHESIS TESTING

We discussed the process of making inferences from data in this chapter, and now we can begin to illustrate the inference process itself. There are two approaches to statistical inference: estimating parameters and testing hypotheses.

The Need for Estimates

Suppose we wish to evaluate the relationship between toxic reactions to drugs and fractures resulting from falls among elderly patients. For logistic and economic reasons, we cannot study the entire population of elderly patients to determine the proportion who have toxic drug reactions and fractures. Instead, we conduct a cohort study with a random sample of elderly patients followed for a specified period. The proportion of patients in the sample who experience drug reactions and fractures can be determined and used as an estimate of the proportion of drug reactions and fractures in the population; that is, the sample proportion is an estimate of the population proportion π.

In another study, we may be interested in the mean rather than the proportion, so the mean in the sample is used as an estimate of the mean population ľ. For example, in a study of a low-calorie diet for weight loss, suppose the mean weight loss in a random sample of patients is 20 lb; this value is an estimate of the mean weight loss in the population of subjects represented by the sample.

Both the sample proportion and the sample mean are called point estimates because they involve a specific number rather than an interval or a range. Other point estimates are the sample standard deviation SD as an estimate of σ and the sample correlation r as an estimate of the population correlation ρ.

Properties of Good Estimates

A good estimate should have certain properties; one is that it should be unbiased, meaning that systematic error does not occur. Recall that when we developed a sampling distribution for the mean, we found that the mean of the mean values in the sampling distribution is equal to the population mean. Thus, the mean of a sampling distribution of means is an unbiased estimate. Both the mean and the median are unbiased estimates of the population mean ľ. However, the sample standard deviation SD is not an unbiased estimate of the population standard deviation σ if n is used in the denominator. Recall that the formula for SD uses n – 1 in the denominator (see Chapter 3). Using n in the denominator of SD produces an estimate of σ that is systematically too small; using n – 1 makes the SD an unbiased estimate of σ.

Another property of a good estimate is small variability from one sample to another; this property is called minimum variance. One reason the mean is used more often than the median as a measure of central tendency is that the standard error of the median is approximately 25% larger than the standard error of the mean when the distribution of observations is approximately normal. Thus, the median has greater variability from one sample to another, and the chances are greater, in any one sample, of obtaining a median value that is farther away from the population mean than the sample mean is. For this reason, the mean is the recommended statistic when the distribution of observations follows a normal distribution. (If the distribution of observations is quite skewed, however, the median is the better statistic, as we discussed in Chapter 3, because the median has minimum variance in skewed distributions.)

Confidence Intervals and Confidence Limits

Sometimes, instead of giving a simple point estimate, investigators wish to indicate the variability the estimate would have in other samples. To indicate this variability, they use interval estimates. A shortcoming of point estimates, such as a mean weight loss of 20 lb, is that they do not have an associated probability indicating how likely the value is. In contrast, we can associate a probability with interval estimates, such as the interval from, say, 15 to 25 lb. Interval estimates are called confidence intervals; they define an upper limit (25 lb) and a lower limit (15 lb) with an associated probability. The ends of the confidence interval (15 and 25 lb) are called the confidence limits.

Confidence intervals can be established for any population parameter. You may commonly encounter confidence intervals for the mean, proportion, relative risk, odds ratio, and correlation, as well as for the difference between two means, two proportions, and so on. Confidence intervals for these parameters will be introduced in subsequent chapters.

Hypothesis Testing

As with estimation and confidence limits, the purpose of a hypothesis test is to permit generalizations from a sample to the population from which it came. Both statistical hypothesis testing and estimation make certain assumptions about the population and then use probabilities to estimate the likelihood of the results obtained in the sample, given the assumptions about the population. Again, both assume a random sample has been properly selected.

Statistical hypothesis testing involves stating a null hypothesis and an alternative hypothesis and then doing a statistical test to see which hypothesis should be concluded. Generally the goal is to disprove the null hypothesis and accept the alternative. Like the term “probability,” the term “hypothesis” has a more precise meaning in statistics than in everyday use, as we will see in the following chapters.

The next several chapters will help clarify the ideas presented in this chapter, because we shall reiterate the concepts and illustrate the process of estimation and hypothesis testing using a variety of published studies. Although these concepts are difficult to understand, they become easier with practice.

SUMMARY

This chapter focused on several concepts that explain why the results of one study involving a certain set of subjects can be used to draw conclusions about other similar subjects. These concepts include probability, sampling, probability distributions, and sampling distributions. We began with examples to illustrate how the rules for calculating probabilities can help us determine the distribution of characteristics in samples of people (eg, the distribution of blood types in men and women; the distribution of heart rate variation).

The addition rule, multiplication rule, and modifications of these rules for nonmutually exclusive and nonindependent events were also illustrated. The addition rule is used to add the probabilities of two or more mutually exclusive events. If the events are not mutually exclusive, the probability of their joint occurrence must be subtracted from the sum. The multiplication rule is used to multiply the probabilities of two or more independent events. If the events are not independent, they are said to be conditional; Bayes' theorem is used to obtain the probability of conditional events. Application of the multiplication rule allowed us to conclude that gender and blood type are independently distributed in humans. The site of infection, however, was not independent from the time during an epidemic at which an individual contracted serogroup B meningococcal disease.

The advantages and disadvantages of different methods of random sampling were illustrated for a study involving the measurement of tracheal diameters. A simple random sample was obtained by randomly selecting radiographs corresponding to random numbers taken from a random number table. Systematic sampling was illustrated by selecting each 17th x-ray film. We noted that systematic sampling is easy to use and is appropriate as long as there is no cyclical component to the data. Radiographs from different age groups were used to illustrate stratified random sampling. Stratified sampling is the most efficient method and is therefore used in many large studies. In clinical trials, investigators must randomly assign patients to experimental and control conditions (rather than randomly select patients) so that biases threatening the validity of the study conclusions are minimized.

Three important probability distributions were presented: binomial, Poisson, and normal (gaussian). The binomial distribution is used to model events that have a binary outcome (ie, either the outcome occurs or it does not) and to determine the probability of outcomes of interest. We used the binomial distribution to obtain the probabilities that a specified number of men with localized prostate tumor survive at least 5 years.

The Poisson distribution is used to determine probabilities for rare events. In the CASS study of coronary artery disease, hospitalization of patients during the 10-year follow-up period was relatively rare. We calculated the probability of hospitalization for patients randomly assigned to medical treatment. Exercise 5 asks for calculations for similar probabilities for the surgical group.

The normal distribution is used to determine the probability of characteristics measured on a continuous numerical scale. When the distribution of the characteristics is approximately bell-shaped, the normal distribution can be used to show how representative or extreme an observation is. We used the normal distribution to determine percentages of the population expected to have systolic BPs above and below certain levels. We also found the level of systolic BP that divides the population of normal, healthy adults into the lower 95% and the upper 5%.

We emphasized the importance of the normal distribution in making inferences to other samples and discussed the sampling distribution of the mean. If we know the sampling distribution of the mean, we can observe and measure only one random sample, draw conclusions from that sample, and generalize the conclusions to what would happen if we had observed many similar samples. Relying on sampling theory saves time and effort and allows research to proceed.

We presented the central limit theorem, which says that the distribution of the mean follows a normal distribution, regardless of the shape of the parent population, as long as the sample sizes are large enough. Generally, a sample of 30 observations or more is large enough. We used the values of heart rate variation from a study by Gelber and colleagues (1997) and values of BP from the Society of Actuaries (1980) to illustrate use of the normal distribution as the sampling distribution of the mean.

Estimation and hypothesis testing are two methods for making inferences about a value in a population of subjects by using observations from a random sample of subjects. In subsequent chapters, we illustrate both confidence intervals and hypothesis tests. We also demonstrate the consistency of conclusions drawn regardless of the approach used.

EXERCISES

1.

a. Show that gender and blood type are independent; that is, that the joint probability is the product of the two marginal probabilities for each cell in Table 4-2.

b.      What happens if you use the multiplication rule with conditional probability when two events are independent? Use the gender and blood type data for males, type O, to illustrate this point.

2. The term “aplastic anemia” refers to a severe pancytopenia (anemia, neutropenia, thrombocytopenia) resulting from an acellular or markedly hypocellular bone marrow. Patients with severe disease have a high risk of dying from bleeding or infections. Allogeneic bone marrow transplantation is probably the treatment of choice for patients under 40 years of age with severe disease who have a human leukocyte antigen (HLA)-matched donor.

Researchers reported results of bone marrow transplantation into 50 patients with severe aplastic anemia who did not receive a transfusion of blood products until just before the marrow transplantation (Anasetti et al, 1986). The probability of 10-year survival in this group of nontransfused patients was 82%; the survival rate was 43–50% for patients studied earlier who had received multiple transfusions. Table 4-12 gives the incidence of acute graft-versus-host disease, chronic graft-versus-host disease, and death in subgroups of patients defined according to serum titers of antibodies to cytomegalovirus from this study. Use the table to answer the following questions.

a. What is the probability of chronic graft-versus-host disease?

b. What is the probability of acute graft-versus-host disease?

c.  If a patient seroconverts, what is the probability that the patient has acute graft-versus-host disease?

d. How likely is it that a patient who died was seropositive?

e. What proportion of patients was seronegative? If this value were the actual proportion in the population, how likely would it be for 4 of 8 new patients to be seronegative?

3. Refer to Table 4-1 on the 150 patients in the pre-epidemic time period for the development of serogroup B meningococcal disease. Assume a patient is selected at random from the patients in this study.

a. What is the probability a patient selected at random had sepsis as the only site of infection?

b. What is the probability a patient selected at random had sepsis as one of the sites of infection?

c.  If race and sex are independent, how many of the white patients can be expected to be male?

Table 4-12. Incidence of graft-versus-host disease.

 Condition Sero-Negativea Sero-Convertersb Sero-Positivec Acute graft-versus-host disease 6/17 2/18 2/12 Chronic graft-versus-host disease 7/14 3/18 2/10 Death 3/7 3/18 2/12 a Patients who had titers of less than 1:8 before transplant and never showed consistent titer increases. One patient received marrow from a cytomegalovirus seropositive donor and 16 patients, from seronegative donors.b Initially seronegative patients who became seropositive within 100 days after transplant. Six patients received marrow from cytomegalovirus seropositive donors and 10 from cytomegalovirus seronegative donors. Serum titers in 2 donors were not determined for antibodies to cytomegalovirus.c Patients with titers of more than 1:8 before transplant. Within this group, seven patients had fourfold increases in serum titers of antibodies to cytomegalovirus and one other patient showed cultures of virus within 3 months of transplantation. Two of the eight patients developed acute graft-versus-host disease, one had chronic graft-versus-host disease, and one died.Source: Adapted and reproduced, with permission, from Anasetti C, Doney KC, Storb R, Meyers JD, Farewell VT, Buckner CD, et al: Marrow transplantation for severe aplastic anemia.Ann Intern Med 1986;104:461–466.

4. A plastic surgeon wants to compare the number of successful skin grafts in her series of burn patients with the number in other burn patients. A literature survey indicates that approximately 30% of the grafts become infected but that 80% survive. She has had 7 of 8 skin grafts survive in her series of patients and has had one infection.

a. How likely is only 1 out of 8 infections?

b. How likely is survival in 7 of 8 grafts?

5. Use the Poisson distribution to estimate the probability that a surgical patient in the CASS study would have five hospitalizations in the 10 years of follow-up reported by Rogers and coworkers (1990). (Recall that the 390 surgical patients had a total of 1487 hospitalizations.) Compare this estimate to that for patients treated medically.

6. The values of serum sodium in healthy adults approximately follow a normal distribution with a mean of 141 mEq/L and a standard deviation of 3 mEq/L.

a. What is the probability that a normal healthy adult will have a serum sodium value above 147 mEq/L?

b. What is the probability that a normal healthy adult will have a serum sodium value below 130 mEq/L?

c.  What is the probability that a normal healthy adult will have a serum sodium value between 132 and 150 mEq/L?

d. What serum sodium level is necessary to put someone in the top 1% of the distribution?

e. What serum sodium level is necessary to put someone in the bottom 10% of the distribution?

7. Calculate the binomial distribution for each set of parameters: n = 6, π = 0.1; n = 6, π = 0.3; n = 6, π = 0.5. Draw a graph of each distribution, and state your conclusions about the shapes.

8.

a. Calculate the mean and the standard deviation of the number of months since a patient's last office visit from Table 4-10.

b. Calculate the mean and the standard deviation of the sampling distribution of the mean number of months from Table 4-11. Verify that the standard deviation in the sampling distribution of means (SE) is equal to the standard deviation in the population (found in part A) divided by the square root of the sample size, 2.

9. Assume that serum chloride has a mean of 100 mEq/L and a standard deviation of 3 in normal healthy populations.

a. What proportion of the population has serum chloride levels greater than 103 and less than 97 mEq/L?

b. If repeated samples of 36 were selected, what proportion of them would have means less than 99 and greater than 101 mEq/L?

10.    The relationship between alcohol consumption and psoriasis is unclear. Some studies have suggested that psoriasis is more common among people who are heavy alcohol drinkers, but this opinion is not universally accepted. To clarify the nature of the association between alcohol intake and psoriasis, Poikolainen and colleagues (1990) undertook a case–control study of patients between the ages of 19 and 50 who were seen in outpatient clinics. Cases were men who had psoriasis, and controls were men who had other skin diseases. Subjects completed questionnaires assessing their life styles and alcohol consumption for the 12 months before the onset of disease and for the 12 months immediately before the study. Use the information in Table 4-13 on the frequency of intoxication among patients with psoriasis.

a. What is the probability a patient selected at random from the group of 131 will be intoxicated more than twice a week, assuming the standard deviation is the actual population value σ? Hint: Remember to convert the standard error to the standard deviation.

b. How many times a year would a patient need to be intoxicated in order to be in the top 5% of all patients?

Table 4-13. Alcohol intake (g/day) and frequency of intoxication (times/year) before onset of skin disease among patients with psoriasis and controls.

 Mean SEM Number of Cases P valuea Alcohol intake: Patients with psoriasis 42.9 7.2 142 0.004 Controls 21.0 2.1 265 Frequency of intoxication Patients with psoriasis 61.6 6.2 131 0.007 Controls 42.6 3.3 247 a Two sided t-test; separate variance estimate.Source: Reproduced with permission from Table III in Poikolainen K, Reunala T, Karvonen J, Lauharanta J, Karkkaimen P: Alcohol intake: A risk factor for psoriasis in young and middle-aged men? Br Med J 1990;300:780–783.

11.    The Association of American Medical Colleges reported that the debt in 2002 for graduates from U.S. medical schools was: mean \$104,000 and median \$100,000; 5% of the graduates had a debt of \$200,000 or higher. Assuming debt is normally distributed, what is the approximate value of the standard deviation?

Footnote

aThe probability of three or more events that are not mutually exclusive or not independent involves complex calculations beyond the scope of this book. Interested readers can consult any introductory book on probability.

﻿
﻿
If you find an error or have any questions, please email us at admin@doctorlib.info. Thank you!