KEY CONCEPTS
|
P.281
PRESENTING PROBLEMS
Presenting Problem 1
Fellowship and residency program directors may want to know which attributes of their program attract applicants. It can be difficult for applicants to choose among programs, and certain factors are more important to some applicants than others. Program directors want to highlight the features of their program that are attractive to the kinds of applicants they would like to have. Caiola and Litaker (2000) wanted to learn more about the factors that appeal to internal medicine residents when choosing a fellowship in general internal medicine (GIM). A group of faculty at their institution developed a 36-item questionnaire to learn how the interview process, program location, and family considerations affected fellowship selection. We discuss how Caiola and Litaker selected the sample of people to complete their questionnaire, how they designed the questionnaire, and their findings. Data are on the CD-ROM in a folder called “Caiola.”
Presenting Problem 2
Numerous studies have demonstrated the protective effect of exercise against many chronic diseases including heart disease, hypertension, diabetes mellitus, stroke, and osteoporosis. Physicians have an important role in educating their patients about the benefits of regular exercise.
Rogers and colleagues (2002) wanted to determine the percentage of internal medicine residents who counseled a majority of their patients about exercise and factors that influenced their exercise counseling behavior. They sent a self-administered questionnaire to all internal medicine residents at six U.S. training programs. Data collected included demographic information, counseling practices, perceived exercise benefits, attitude toward counseling, barriers to counseling, and personal exercise habits. We discuss the response rate they obtained in their study, how they selected the sample, and their conclusions, using data on the CD-ROM in a folder called “Rogers.”
Presenting Problem 3
A study by Lapidus and his colleagues (2002) was a presenting problem in Chapters 3 and 6. Recall that they undertook a survey to assess domestic violence (DV) education and training and whether pediatricians and family physicians screen emergency department patients for DV. Domestic violence was defined as “past or current physical, sexual, emotional, or verbal harm to a woman caused by a spouse, partner, or family member.” Please see Chapters 3 and 6 for more detail. We describe how they obtained their sample and some of their findings using data in the folder called “Lapidus.”
Presenting Problem 4
Urinary tract infections (UTIs) are among the most common bacterial infections, accounting for 7 million outpatient visits and 1 million hospital admissions in the United States each year. They are usually caused by Escherichia coli and are often treated empirically with trimethoprim–sulfamethoxazole. Recent guidelines from the Infectious Disease Society of America recommend this drug as standard therapy with specific fluoroquinolones as second-line choices. Two important factors in the selection of treatment are the health care costs of drug therapy and emergence of resistance to trimethoprim–sulfamethoxazole among E. coli strains causing UTIs. Huang and Stafford (2002) used survey data from the National Ambulatory Medical Care Survey (NAMCS) to assess the demographics and clinical characteristics of women who visit primary care physicians and specialists for UTIs.
THE RESEARCH QUESTIONS
With a little common sense and the information from this chapter, you can develop questionnaires to evaluate courses, learn student preferences for use in an educational program, and other undertakings in which you want to gather information locally. For more extensive projects, especially those related to a research project, you may wish to consult a biostatistician, epidemiologist, or other professional with survey expertise
Everyone knows about surveys. In fact, the average person's opinion of statistics is greatly based on the results of polls and marketing surveys. Many people associate surveys with the U.S. census, which occurs every 10 years, but a census is not really a survey. A census queries everyone in the population, whereas a survey queries only a sample from the population. Survey research became more prominent after World War II as manufacturers began to market products to the public and hired marketing experts to learn what would make the product sell. At the same time, political pollsters began to have an increasing presence garnering public opinion about specific issues and learning which candidate was leading during an election campaign. Today surveys seem omnipresent, and, as a result, many people think that survey research is easy—just design some questions and get some people to answer them. The purpose of this chapter is to acquaint you with some of the issues in designing, administering, and interpreting surveys.
The term poll is generally associated with political polls, but it simply means asking questions in order to learn information or opinions. The Gallup Organization and the Roper Center are two of the largest polling organizations in the United States. They both use proper methods to design surveys and to select random samples of respondents.
A large number of excellent resources exist for those who want to design a survey. We have selected some salient issues to discuss in this chapter, and we have used a number of resources which you may wish to consult if you desire further information: Dillman (2000), Fink and Kosecoff (1998), Litwin and Fink (1995), Korn and Graubard (1999), and Rea and Parker (1997).
Determining the Research Question
The need to know the answer to a question generally motivates survey research, such as those in the Presenting Problems. As with any other research question, a crucial first step is to review the literature to learn what is known on the topic and what other methods have been used previously
It may be difficult to specify the precise issues that should be addressed in a survey. For instance, suppose a researcher wants to know how health care workers view the elderly, and after performing a MEDLINE search, can find little information on the subject. In this situation, focus groups may help refine the issues. Focus groups are interviews of a small number of people, generally 6–10, rather than individual interviews, and are done in person or over the telephone. One can get a great deal of information during a focus group session by obtaining in-depth responses to a few general questions. For information about focus groups, see Kruger and Casey (2000) and Morgan (1997). Of course, asking an expert, if one is available, is always advisable—in developing a research question as well as in reviewing a draft questionnaire.
Decide on the Survey Method
Most surveys use self-administered questionnaires—either in person or via mail or email, or interviews—again in person or over the telephone. Advantages and disadvantages exist for each method, some of which are illustrated in Table 11-1
When interviews are used for survey research, they are often called structured interviews, in which the same questions are administered to each subject in the same order, and no coaching is permitted. In other words, the researcher tries to make the questions and the manner in which they are asked identical for all persons being interviewed. In other situations, less-structured interviews that permit the investigator to probe areas can yield rich results, although the data do not lend themselves to the quantitative methods we have used in this book. A famous example was the pioneering work in human sexuality by Alfred Kinsey in the 1940s in which he interviewed many people to learn about their sexual habits. Book 4 in the Survey Kit by Litwin and Fink (1995) is devoted to how to conduct interviews in person and by telephone.
The majority of the studies in the medical literature use self-administered questionnaires, as did the investigators in the first three Presenting Problems. Book 3 in the Survey Kit by Litwin and Fink (1995) discusses how to conduct self-administered and mail surveys. Dillman (2000) covers email and Internet surveys. Researchers are also turning to national data bases to answer survey questions, such as the study by Huang and Stafford (2002).
Surveys are generally thought of as cross-sectional and measuring the current situation, and certainly most surveys fit this category. Questionnaires are used in cohort and case–control studies as well, however. The Nurses' Health study is a good example; this study began in the late 1970s with follow-up questionnaires mailed to over 115,000 nurses every 2 years (Colditz et al, 1997).
Table 11-1. Advantages and disadvantages of different survey methods. |
||||||||||||||||||||||||||||||||||||||||
|
Developing the Questions
Many ways to ask questions exist, and survey designers want to state questions so that they are clear and obtain the desired information. We discuss a number of examples, some of which are modeled on the text by Dillman (2000) and subsequently modified to reflect medical content by Dr. Laura Q. Rogers.
Decide format of questions—open-ended vs closed
Open-ended questions are ones that permit the subject to respond in his or her own words. They are sometimes appropriate, as shown in Table 11-2, such as when the topic has not been studied before, and it is important to obtain answers that have not been cued by any listed responses. Many researchers analyze the content of the answers and try to classify them into categories. A primary advantage of open-ended questions is the capability to report some of the prototypic answers using a subject's own words. Closed-response questions are more difficult to write, but their payoff is the ease with which the answers can be analyzed and reported. Many of the statistical methods discussed in this text can be used to analyze survey responses. A compromise approach can be taken by using a set of responses gleaned from the literature or focus groups, but also adding an opportunity for subjects to write in a brief answer if none from the responses fits. These questions generally have an “Other—please explain” at the end of the response set
Table 11-2. Open versus closed questions. |
||||||||||||||||||
|
An example of a question with open versus closed questions is given in Box 11-1. Does the closed response set have an answer with which you would be comfortable?
Decide on the scale for the answer
If closed questions are used, the researcher needs to decide the level of detail required in the answers. The scale helps determine which methods can be used to analyze the results. Sometimes knowing only whether a subject agrees with the question is sufficient (nominal), such as whether a large city location is important in the choice of a training program (Box 11-2). An ordinal scale may be used if knowing gradations of importance is desired. Alternatively, an ordinal scale giving a choice of city sizes provides more specific information. At the other end of the spectrum is an open-ended question in which the subject provides a number (numerical). Depending on the question, we generally recommend that researchers collect data in as much detail as is reasonable; for instance, it is always possible to form categories of age for tables or graphs later (see Chapter 3). On the other hand, too much detail is not necessary and leads to unreliable data. For the question about city size, we would opt for one of the two ordinal response sets.
Balancing positive and negative categories
Questions using an ordinal response should provide as many positive as negative options. Box 11-3 shows two sets of options regarding students' satisfaction with the Introduction to Clinical Medicine course. The first set places the neutral answer in the fourth position and provides only one answer in which students can express dissatisfaction. The second set is more likely to provide valid responses, with the neutral answer in the middle position and the same number of positive and negative options.
Box 11-1. OPEN VERSUS CLOSED QUESTIONS.
|
Avoid vague qualifiers
What are the potential problems in using terms like sometimes, often, rarely, etc? Box 11-4 illustrates two questions that ask the amount of alcohol consumed weekly. The choices specifying actual numbers of drinks is much less ambiguous and will be easier to interpret.
Use mutually exclusive categories
In filling out a questionnaire, how many times have you encountered a question in which you don't understand the issue being addressed? This may result because the options are not mutually exclusive. For example, the question in Box 11-5 is confusing because it is mixing educational activities and sources of information. The easiest solution is to form two questions, each of which has mutually exclusive answers.
Potentially objectionable questions
Some questions are viewed as very personal, and people hesitate to divulge the information. Examples include income, sexual activity, and personal habits; see Box 11-6. These questions can be dealt with in two ways. First is to soften the manner in which the question is asked, perhaps by asking for less detail; it is generally better to obtain an approximate answer than none at all. Second is placement in the questionnaire itself; some survey experts recommend placement near the end of the questionnaire, after people have become comfortable with answering questions.
Box 11-2. CHOOSING THE SCALE FOR THE QUESTION.
|
Check all-that-apply items
Items in which subjects can choose as many options as they wish are often used when survey questions ask about qualities of a product or why consumers selected a specific model, school, or service. They do not force the subject to single out one best feature. They can, however, be tricky to analyze. The best approach is to treat each option as a yes/no variable and calculate the percentage of time each was selected. Rogers and colleagues (2002) took this approach when they wanted to know which questions physicians routinely ask of patients who are not exercising adequately. Box 11-7 shows that how they used the responses of ask/don't ask for each question.
Box 11-3. BALANCED RESPONSES.
|
Box 11-4. AVOIDING VAGUE QUALIFIERS.
|
Box 11-5. NONHOMOGENEOUS AND MUTUALLY EXCLUSIVE CATEGORIES.
|
Using “don't know”
No consensus exists in the use of a “Don't Know” or “Undecided” category. Some researchers do not like to provide an opportunity for a subject not to commit to an answer. Others point out that not providing this type of option forces respondents to indicate opinions they do not really hold. If these categories are used, placement at the end of the question rather than in the middle of the question has been shown to increase the completion rate of questions by 9%.
Ranking versus rating
In general, subjects confuse ranking scales in which they are asked to rank a number of choices from 1 to whatever. We believe that ranking options is fine when the researcher wants to know which choice is viewed as the “best” by most subjects. Selecting the top 3 choices also works well; however, beyond that, many people begin to have difficulty. For instance, some people omit one number and/or rank two options with the same number. Furthermore, although most people can discriminate among a limited number of choices, they have trouble when the list becomes long. Rating scales have difficulties as well, including the tendency for responses often to cluster at one end of the scale.
Box 11-6. SOFTENING POTENTIALLY OBJECTIONAL QUESTIONS.
|
Summary of suggestions for writing questions
Responses will be more valid and response rates will be higher if user-friendly questions are used on a survey. User-friendly questions have clear instructions with short, specific questions and are stated in a neutral manner. They avoid jargon and abbreviations that may be confusing, check-all-that-apply questions, and questions with too much detail (eg, how many minutes do you spend…).
Box 11-7. ILLUSTRATION OF YES/NO SCALES INSTEAD OF CHECK-ALL-THAT-APPLY.
For your patients who you believe do NOT exercise adequately, about which of the following aspects do you ask routinely?
|
Issues Regarding Scales
If questions with ordinal scales are used, the researcher must decide among different options. Likert scales are commonly used on many surveys. They allow answers that range from “strongly disagree” to “strongly agree,” from “most important” to “least important,” and so on. We used Likert scales in the questions in Boxes 11–2 and 11–3 and listed the options vertically. When several questions use the same scale, the options may be listed in a row. Box 11-8 shows an excerpt from the survey by Caiola and Litaker (2000) in which several questions are asked about the quality of a fellowship. Note that the researchers provide an opportunity for respondents to add another quality if they wish
No consensus exists in the survey literature about the number of categories to use. An even number of categories, such as the first question in Box 11-9, forces a respondent to choose some level of importance or unimportance, even if he or she is totally neutral. Our personal preference is for an odd number of categories with either five or seven categories, as illustrated in the second and third questions in Box 11-9. The choice of the number of categories can have a major effect on conclusions.
QUESTIONNAIRE LAYOUT
No hard and fast rules dictate the way a questionnaire is formatted, but some general, common sense guidelines apply. The first issue is length; although shorter surveys are generally preferable to longer surveys, Dillman (2000, page 305) cites several studies that show that questionnaires up to four pages in length have similar response rates
Box 11-8. ILLUSTRATION OF LIKERT SCALE.
Perceived Quality of a Fellowship Program:
Q-3. How important were the following factors as you decided on your fellowship program? (Please circle ONE number, with 1 being Not Very Important and 5 being Very Important)
|
Well-designed questions place instructions where needed, even when it means repeating the instructions at the top of a continuation page. Avoid skipping or branching questions if possible; if not, use directional arrows and other visual guides to assist the subject. Most researchers opt to place easier questions first and to list questions in logical order so that questions about the same topic are grouped together. When scales are used, subjects are less confused when the scale direction is listed in a consistent manner, such as from very important to very unimportant. Although no consensus exists on whether to place demographic items at the beginning or end of the questionnaire, we opt for the latter. Lapidus and colleagues (2002) asked demographic questions at the beginning of their questionnaire, and Rogers and colleagues (2002) placed demographics at the end in a section called “The next group of questions is about you.”
RELIABILITY AND VALIDITY OF SURVEY INSTRUMENTS
We advise researchers to use existing questionnaires or instruments if possible. Not only does this save time and effort in developing the questionnaire, it also avoids the need to establish the reliability and validity of the instrument. Furthermore, it is possible to make more direct comparisons between different studies if they use the same instrument. If an existing instruments meets 80% or more of the needs of a researcher, we recommend using it; additional questions that are deemed to be crucial can be included on a separate page.
Ways to Measure Reliability
We briefly discussed intra- and interrater reliability in Chapter 5 when we introduced the kappa statistic as a measure of agreement between observers. Questionnaires need to be reliable as well. The basic question answered by reliability is how reproducible the findings would be if the same measurement were repeatedly made on the same subject
Five different types of reliability are listed in Table 11-3, along with methods to measure each type. An instrument's capacity to provide the same measurement on different occasions is called the test–retest reliability. Because it is difficult to administer the same instrument to the same people on more than one occasion, questionnaire developers and testing agencies often use internal consistency reliability as an estimate of test–retest reliability. Reliability as measured in these ways can be thought of as the correlation between two scores; it ranges from 0 to 1.00; an acceptable level of reliability is 0.80 or higher.
The internal consistency reliability of the items on the instrument or questionnaire indicates how strongly the items are related to one another; that is, whether they are measuring a single characteristic. Testing agencies that create examinations, such as the SAT or National Board USMLE Examinations, sometimes refer to the internal consistency reliability as Cronbach's alpha. (Note that this is not the same alpha as we use to measure a type I error in hypothesis testing.) Rogers and colleagues (2002) developed their questionnaire from previously used questionnaires, so they needed to establish the reliability of the specific instrument they used. They reported a Cronbach's alpha of 0.87 for the nine questions that measured confidence in providing exercise counseling, indicating that, if the questionnaire were given to the same subjects twice, the correlation between their confidence scores would be approximately 0.87. Testing agencies sometimes use alternative forms of a test in which the items on the tests differ, but they measure the same thing and have the same level of difficulty. Alternative forms reliability is more of an issue with tests than with questionnaires.
Box 11-9. NUMBER OF RESPONSE OPTIONS.
|
Our discussion of intra- and interrater reliability in Chapter 5 referred to nominal measurement, such as positive vs negative. If the measurement is numerical, such as the score on a test or a section of a questionnaire, intra- and interobserver agreement is measured by the intraclass correlation coefficient. The ordinary correlation coefficient is sensitive only to random error, also called noise, in a measurement. The intraclass correlation, however, is sensitive to both random error and systematic error, also called statistical bias. Statistical bias occurs when one observer always scores candidates higher by the same amount than another observer. More details about psychological measurement can be found in classic texts on measurement, such as the one by Anastasi and Urbina (1997).
Validity of Measurements
Validity is a term for how well an instrument (or measurement procedure) measures what it purports to measure. The issue of validity motivates many questions about measuring specific characteristics in medicine, such as how accurate the arthritis functional status scale is for indicating a patient's level of activity; how accurate the National Board Examination is in measuring students' knowledge of medicine; or, in the case of questionnaires, how well the Medical Outcomes Study Short Form 36 (MOS SF-36), discussed in Chapter 2, measures quality of life
Commonly used measures of validity are content, face, criterion, and construct validity. For a test or questionnaire, content validity indicates the degree to which the items on the instrument are representative of the knowledge being tested or the characteristic being investigated. Face validity refers to the degree to which a questionnaire or test appears to be measuring what it is supposed to measure. In other words, a questionnaire about domestic violence training should have questions related to that issue.
Table 11-3. Types of reliability. |
||||||||||||||||||
|
Criterion validity refers to the instrument's capacity to predict a characteristic that is associated with the characteristic. For example, the criterion validity of the MOS SF-36 could be measured by comparing its quality-of-life score with subject interviews and physical examinations. The criterion may indicate either a concurrent or a future state. Ideally, criterion validity is established by comparing the measurement to a gold standard, if one exists. The final type of validity, construct validity, is more abstract and difficult to define. It consists of demonstrating that the instrument is related to other instruments that assess the same characteristic and not related to instruments that assess other characteristics. It is generally established by using several instruments or tests on the same group of individuals and investigating the pattern of relationships among the measurements (Table 11-4).
ADMINISTRATION OF SURVEYS
Several issues related to the design and administration of questionnaires, along with recommendations from the survey research literature, are listed in Table 11-5.
Pilot Testing
Almost everyone who has consulted with one of the authors (BD) on questionnaire design has lamented, once the results are in, that they wish they had included another question, or asked a question in a different way, or provided other response options, and so on. It is almost impossible to carry out a perfect survey, but many problems can be caught by pilot testing the instrument. A pilot test is carried out after the questionnaire is designed but before it is printed or prepared for administration. Pilot testing may reveal that the reading level is not appropriate for the intended subjects, that some questions are unclear or are objectionable and need to be modified, or that instructions are unclear. A large sample is not required to pilot test an instrument; it is more important to choose people who will provide feedback after completing the questionnaire. In most cases, the best subjects will not be your friends or colleagues, unless, of course, they are representative of the group who will receive the survey. If a large number of changes are required, it may be necessary to repeat the pilot test with another group of subjects.
Table 11-4. Types of validity. |
||||||||||||||||||
|
Response Rates
Lapidus and colleagues sent questionnaires to 525 pediatricians and 378 family practitioners identified from lists obtained from the Connecticut chapters of the American Academy of Pediatrics and the American Academy of Family Physicians. After mailing a questionnaire, they received responses from 24%. Had they stopped at this point, what level of confidence would you have in their findings? Would you question whether 24% of the potential respondents were representative of all physicians who received the survey?
A high response rate increases our confidence in the validity of the results and the likelihood that the results will be used. Lapidus and colleagues sent out two additional mailings and ultimately increased their response rate to 49%. If demographic information on the sampled group is available, researchers can compare characteristics, such as sex, age, and other important variables, to learn if the distribution of these characteristics is the same in the subjects who returned the questionnaire. Unfortunately, Lapidus and colleagues did not have this type of information available on the physicians who did not respond.
Many suggestions on ways to go about follow-up have been made. The consensus indicates that the best response occurs with three to four follow-ups at approximately 2-week intervals. Each subsequent follow-up has a smaller yield, so it may be wise to stop with fewer follow-ups. Lapidus and colleagues (2002) received an additional 19% and 5% response, respectively, with their second and third mailings.
Chances are, a fourth mailing would have little yield. An alternative to an additional follow-up questionnaire is to send reminder postcards 1–2 weeks after the initial mailing; they may increase responses by 3–4%. For truly intractable nonresponders, telephone follow-up with a shortened questionnaire containing only key questions is worthwhile if resources permit.
Table 11-5. Summary of issues related to administration of questionnaires. |
||||||||||||||||||||||||||
|
Colleagues and students often ask what constitutes a good response rate. Far too many surveys report response rates less than 40%. With effort, it should be possible to obtain responses of 50% or more. If people are very interested in the topic, response rates often approach 70%, but it is possible to obtain as much as 85% or more in a highly selected sample or a “captive population.” For this reason, some teachers require students to complete a questionnaire before receiving a grade; in continuing-education courses, someone often stands at the door to collect the course evaluations as attendees leave. Caiola and Litaker (2000) sent surveys to fellowship directors to distribute to their fellows and obtained a 75% response rate. Mail surveys of physicians, on the average, have a response rate of approximately 60% (Cummings et al, 2001). The documentation for the National Ambulatory Medical Care Survey data base used by Huang and Stafford (2002) had annual response rates ranging from 68% to 74%.
Advance Notification
Many survey specialists recommend notifying the people who are to receive the survey in advance of its administration. Prenotification may by be done by letter, telephone, or, increasingly, by email. The prenotification should include information on who is doing the survey, what its purpose is, why the subject has been selected to receive the survey, how the results will be used, whether responses will be anonymous, and when the questionnaire will be mailed (or emailed) or the interview scheduled. Prenotification has been reported to increase response rate by 7–8%.
Cover Letters and Return Envelopes
If a questionnaire is administered directly, such as to a group of physicians in a continuing-education course, it is not necessary to include a cover letter. Otherwise, one is essential. Cover letters should be short, relevant, on letterhead, and signed. Information on the letter includes the purpose of the survey, why the recipient's response is important, and how the data will be used. Some researchers offer to share the results. To maintain anonymity, a separate postcard can be included for the recipient to return indicating a desire for a copy of the results. More questionnaires are returned when a stamped envelope is included; this practice is reported to increase response rates by 6–9%.
Incentives
Providing incentives to complete a questionnaire is controversial, but research has shown that it may increase response rates more than any other single action, except repeated follow-ups. Monetary incentives, even modest ones of a few dollars, are reported to increase responses by 16–30% and nonmonetary incentives by up to 8%. Response rates are similar, regardless of whether the incentive is sent with the questionnaire or as a reward after it is returned. Material incentives also increase response rates, but by only about half as much, and lotteries or chances to win tickets or prizes have relatively little effect. Dillman (2000) provides an extensive discussion of these and other issues related to improving response rates.
Anonymity and Confidentiality
The cover letter should contain information on anonymity or confidentiality. Depending on the purpose and sensitivity of the questionnaire, it may be advisable to make the returns completely anonymous. The researcher can still keep track of who returns the questionnaire by asking the responder to mail a separate postcard at the time he or she returns the questionnaire. Only the postcard and not the questionnaire itself contains any information that can be used to identify the respondent. This practice permits the researcher to remove the responder's name from the follow-up list, thereby saving administrative costs and potentially annoying the responder
Confidentiality of responses is a different issue. If questionnaires can be identified with a number or code, it is easier to know who has returned the questionnaire and streamlines the follow-up process. Regardless, subjects always need to be assured that their responses will be kept confidential. No individual's response can be identified in any information reported or otherwise communicated. With the increasing protection for human subjects, confidentiality is almost always required by institutional review boards (IRBs) as a prerequisite to approving the survey.
SELECTING THE SAMPLE & DETERMINING N
Properly done survey research is based on the principle that a randomly selected sample can represent the responses of all people. Randomly selected samples are, in fact, more accurate than questioning everyone. For instance, the U.S. census is well known to have a problem with undercounting certain populations, and the American Statistical Association has recommended the use of sampling to deal with this shortcoming. The U.S. census continues to use a census rather than random samples, largely for political rather than statistical reasons; the results are used to reallocate seats in the House of Representatives and form the foundation for allocation of federal funds to states. Research has shown that properly selected samples provide more accurate estimates of underrepresented populations than the census, and random samples are generally used to correct the census undercount.
Review of Sampling Methods
We discussed random sampling methods in Chapter 4 and very briefly review them here. A simple random sample is one in which every subject has an equal chance of being selected.
Random samples are typically selected using computer-generated random numbers or a table of random numbers, as we illustrated inChapter 4. A systematic random sample is one in which every kth item is selected, where k is determined by dividing the number of items in the sampling frame by the desired sample size. Systematic sampling is simple to use and effective as long as no cyclic repetition is inherent in the sampling frame
A stratified random sample is one in which the population is first divided into strata or subgroups, and a random sample is then selected from each stratum. This method, if properly used, requires the smallest sample size, and it is the method used for most sponsored surveys and by most professional polling organizations. The data base used by Huang and Stafford (2002) was a random sample of U.S. physicians stratified by geographic region and specialty. Stratified sampling requires great care in analyzing the data, because a response from one physician may represent 10 other physicians, whereas the response from a second physician may represent 25 other physicians. A cluster random sample occurs when the population is divided into clusters and a subset of the clusters is randomly selected. Cluster sampling requires a larger sample size than other methods and presents even greater challenges for analysis. Recall that nonprobability sampling to obtain quota or convenience samples does not fulfill the requirements of randomness needed to estimate sampling errors and use statistical methods.
Finding the Sample Size for a Survey
Many national survey organizations, such as Gallup, use approximately 1000 people in a given poll, and, as previously noted, large sponsored surveys typically use complicated sampling designs. However, many of the studies published in the medical literature use samples selected by the investigators
When determining a sample size, we first ask, “What are study outcomes?” Is it the percentage of people who respond a given way? The “average” response (based on a scale of some sort)? The difference between two or more types of respondents (either percentages or averages)? or, The relationship among questions, such as: Do people who exercise regularly also eat a proper diet? By this point in this book, you probably recognize each of these situations as those we covered in Chapter 5 (for a proportion or mean), Chapters 6 and 7 (comparing two or more proportions or means), or Chapter 8 (correlations among variables). In each of these chapters we illustrated methods for finding a sample size, so we review these very briefly here and refer you back to the previous chapters for more detail. Recall that power is the sample size needed to have a reasonable chance of finding an effect if there is one.
Caiola and Litaker (2002) wanted to estimate the percentage of general internal medicine (GIM) fellows who felt that a research reputation was important in choosing a fellowship. They might have expected that approximately 50% would view research as an important factor. Using the PASS program, we see from Box 11-10 that a sample of 110 is sufficient for a 95% (actually 96.5%) confidence interval if the proportion is around 50 and the population size is 146. Caiola and Litaker had responses from 109 of the 146 eligible fellows, so they can be assured that their finding of 53% is within ą 3.5% of the true proportion.
Box 11-10. POWER FOR A PROPORTION.
Figure. No caption available. |
||||||||||||||||||||||||||||||||
Confidence Interval of a Proportion |
||||||||||||||||||||||||||||||||
|
Figure. Chart Section |
Box 11-11. SAMPLE SIZE FOR CONFIDENCE INTERVAL FOR A MEAN.
Figure. No caption available. |
When the sample size is 139, a two-sided 95.0% confidence interval for a single mean will extend 0.5 from the observed mean, assuming that the standard deviation is known to be 3.0 and the confidence interval is based on the large sample z statistic.
Source: Data, used with permission from Rogers LQ, Bailey JE, Gutin B, Johnson KC, Levine MA, Milan F, et al: Teaching resident physicians to provide exercise counseling: A needs assessment. Acad Med 2002; 77: 841–844. Output produced with NCSS, used with permission.
Box 11-12. SAMPLE SIZE FOR COMPARING TWO PROPORTIONS.
Figure. No caption available. |
One goal of the proposed study is to test the null hypothesis that the proportion positive is identical in the two populations. The criterion for significance (alpha) has been set at 0.05. The test is 2-tailed, which means that an effect in either direction will be interpreted. With the proposed sample size of 180 and 180 for the two groups, the study will have power of 82.4% to yield a statistically significant result. This computation assumes that the difference in proportions is -0.15 (specifically, 0.50 versus 0.65).
Box 11-13. Sample Size for a correlation coefficient.
Figure. No caption available. |
||||||||||||||||||||||||||||||||||||||||||
One Correlation Power Analysis |
||||||||||||||||||||||||||||||||||||||||||
|
Summary Statements
A sample size of 100 achieves 86% power to detect a difference of 0.30 between the null hypothesis correlation of 0.30 and the alternative hypothesis correlation of 0.00 using a two-sided hypothesis test with a significance level of 0.05.
Figure. Chart Section |
None of our presenting problems involved estimating means, but, for the sake of illustration, we'll assume that Rogers and colleagues (2002) wanted to estimate the mean score for the sum of the six questions dealing with the ability of exercise to prevent disease or improve health: decreases overall mortality, decreases coronary heart disease mortality, decreases COPD mortality, improves blood sugar control, improves lipoprotein profile, prevents hip fractures. Suppose they want to form a 95% confidence interval about the mean score for these six items. Box 11-11 shows the output from nQuery when the standard deviation of the sum is estimated to be 3, and the desired confidence interval is within ą 0.5. A sample size of 139 is sufficient, less than the 251 in the Rogers study.
We use the Power and Precision program to estimate the sample size needed by Lapidus and colleagues (2002) to compare the proportions of those with and without domestic violence (DV) training to screen their own patients for DV. The output in Box 11-12 shows that a sample of 180 physicians who received training and 180 who did not is sufficient to detect a 15% difference (65–50%) in screening behavior.
To illustrate finding sample sizes for a correlation, we return to the Rogers and colleagues' study (2002). They wanted to calculate the correlation between a physician's confidence in his or her exercise counseling skills and the percent of patients actually counseled. PASS indicates that a sample of 100 is sufficient to detect a correlation of 0.30, and again, the Rogers study more than meets this requirement (Box 11-13).
Table 11-6. Analysis of Caoila and Litaker data. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ANALYSIS OF SURVEY RESULTS
Almost all of the statistical procedures we discussed in Chapters 3–10 are used to analyze survey data. Confidence limits for proportions, means, the difference between proportions or means, and correlations are all relevant, in addition to chi-square and other nonparametric tests; t tests; analysis of variance; and regression, including logistic regression. Procedures that are rarely used include those that analyze time-dependent outcomes, such as Kaplan–Meier curves and the Cox model, because surveys rarely ask subjects the length of time until an event occurs.
Analyzing Caiola and Litaker
Caoila and Litaker listed the mean and standard deviation for selected questions, but they did not perform any statistical analyses. We have reproduced a selection of their results in Table 11-6. What was the most important factor? the least important? If you wanted to analyze the answers statistically, what procedure would you use? (See Exercise 2).
Analyzing Rogers and Colleagues
Rogers and her colleagues (2002) want to know what factors are related to a physician's decision to counsel patients about exercise. Factors associated with the percent of patients counseled are listed in Table 2 of the published article; we have reproduced a portion in
Table 11-7. It is apparent that having confidence in one's counseling skills and believing that exercise counseling is a high priority have the highest correlations
Table 11-7. Analysis of data from Roger and colleagues. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 11-8. Regression analysis from Rogers and colleagues. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The authors also developed some regression models. We have replicated a stepwise regression to predict the percent of patients that are counseled. Before looking at Table 11-8, can you predict which variable will enter the regression equation first? It is the belief that exercise counseling is of high priority, because it had the highest correlation (0.32) with the dependent variable. In the second step, having confidence in one's ability to counsel enters the regression equation. Finally, in the third step, the physician's level of training adds incrementally to the prediction. Why did only three variables enter the regression equation? See Exercise 4.
Analyzing Lapidus and Colleagues
We reproduced a portion of Table 2 from the Lapidus and colleagues' (2002) article in Table 3-9. Lapidus and colleagues also used logistic regression to learn which variables were significantly associated with screening for domestic violence (DV). We have replicated the analysis, and part of it is shown in Table 11-9. Two variables—an urban location and any training in DV—are significant predictors of whether physicians screen their patients for DV. The Wald statistic in Table 11-9 is used to test the significance of the regression coefficient (B). Recall that it is necessary to exponentiate or find the antilog of B to obtain an estimate of the odds ratio. The odds ratios and their 95% confidence limits are given in the last three columns. Clearly, previous training in DV is the most predictive of future screening behavior, with those having training almost 5 times more likely to screen.
Table 11-9. Logistic regression to predict DV screening. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Analyzing Huang and Stafford
A situation we have not discussed before is the use of weights in analyzing data. As we mentioned with stratified samples, not all subjects represent the same number of nonsampled subjects. For instance, suppose a researcher wants to survey 500 physicians in each of the states of Illinois, New York, and California. The estimated 2002 census for these states was 12.5, 19, and 35 million, respectively. Assuming that the number of physicians is similarly distributed, a randomly sampled physician in California would “represent” almost twice as many other physicians as in New York and almost three times as many as in Illinois. Thus, a single response would have these approximate weights: 1 if from Illinois, 1.5 if from New York, and 2.8 if from California. These weights are used in any analysis that involves responses from more than one state
Huang and Stafford (2002), in their study of the treatment of urinary tract infections (UTI) in women, had to include weights to reflect a complicated sampling plan. Their data was from the National Ambulatory Medical Care Survey (NAMCS) from 1989 through 1998. Each year NAMCS uses master lists from the American Medical Association and American Osteopathic Association to select a random sample of physicians, stratified by geographic area and specialty. For each participating physician, patient visits during a randomly selected week are systematically sampled. Huang and Stafford selected all female patients, ages 18 to 75 years, from this database who had an ICD-9 code for acute cystitis or UTI. They subsequently excluded any women who had codes for urologic procedures, pregnancy, diabetes mellitus, cancer, and several other comorbid conditions. The result was a sample of 1478 patient visits. Because of the complex sampling system, Huang and Stafford had to include different weights for each individual woman according to the analysis they were doing. These authors were kind enough to share their data set with us, but we opted not to include it on the CD-ROM because of the complications involved with the analyses.
SUMMARY
We discussed four survey articles in this chapter. Results from the survey of GIM fellows by Caiola and Litaker (2000) indicated that program location was the top reason for selecting a fellowship. Other important factors were opportunities for research, the availability of a mentor, and the national reputation of the program. The researchers pointed out that their study had a limitation because they excluded programs that concentrate on informatics or epidemiology. They also concluded that surveying applicants for a GIM fellowship rather than GIM fellows might have provided additional information
Of the 313 questionnaires sent in the study by Rogers and colleagues (2002), 251 were returned, representing an 80% response rate. Their results showed that even though nearly all of the residents understood the beneficial health effects of exercise, only 15% counseled more than 80% of their clinic patients. Factors most strongly associated with exercise counseling were the physician's confidence in exercise counseling, a perception of exercise as having a high priority, and postgraduate years of training. The authors highlight the inadequate percentage of resident physicians who provide exercise counseling and stress the need to address factors that influence counseling behavior in educational programs.
Overall, 49% of the physicians surveyed about domestic violence screening responded after a total of three mailings (Lapidus et al, 2002). Results revealed that 33% of responding physicians did not screen for DV at all. Only 12% reported routine screening at all well-child visits, and 61% screened only select women. Physicians in an HMO, hospital-based, university-based, or public practice were more likely to screen routinely for DV than those in a suburban setting. Previous DV training was the strongest predictor of both routine and selective screening—physicians with previously DV training were 5 times more likely to screen for DV than those with no prior training. This survey reveals that few primary care physicians are routinely screening their patients for DV and suggests that physician training may be an effective intervention for improving DV screening rates. The authors acknowledged the possibility of response bias but could not assess it because there was no information on the demographic or practice characteristics of nonresponders.
Huang and Stafford (2002) evaluated changes in the demographics of patients and trends in physician practices, such as the frequency of ordering urinalysis and antibiotic selection over time. The frequency of urinalysis ordering declined from 90% of visits in 1989–1990 to 81% of visits in 1997–1998. An antibiotic was prescribed or renewed at 67% of visits. The most commonly prescribed antibiotics were trimethoprim or trimethoprim–sulfamethoxazole, selected fluoroquinolones, and nitrofurantoin. The use of trimethoprim or trimethoprim–sulfamethoxazole declined from 49% at the beginning of the decade to 24% at the end, whereas use of recommended fluoroquinolones and nitrofurantoin increased over time. Different specialists have dissimilar antibiotic prescribing patterns. The authors concluded that this survey, which provides a national and longitudinal description of patients, physicians, and treatment choices for UTI visits, raises concerns about antimicrobial resistance and costs of treatment. Fluoroquinolone resistance may increase because of increased use of this class of antibiotics and a broader use of newer drugs in this class. Cost implications are dramatic—in 1999 a 10-day course of trimethoprim–sulfamethoxazole cost $1.79, but 10 days of ciprofloxacin costs $70.98.
We reviewed some of the important factors when designing and administering a survey. We provided only a general overview of survey research, but provided a number of references for those readers who want to learn more. We recommend that researchers consider using a standard questionnaire if at all possible; it is much faster because the work of writing questions and pilot testing has already been done. If standard questionnaires are used, researchers can compare their results with other studies, and the reliability and validity of the instrument has generally been established.
We expect that the surveys using email and the Internet will become increasingly popular. Already some researchers are using them and others use fax machines for responses to short questionnaires. Dillman (2000) is an excellent source for information on email and Internet surveys. Several first-rate software packages are available for those interested in designing questionnaires and using the Internet. We used SurveySolutions by Perseus to produce some problem questions and how to revise them. SurveySolutions also enables the completed questionnaire to be used with email or on a Website. Other software packages for Web-based surveys include DreamWeaver by Macromedia, FrontPage by Microsoft, and ezsurvey by Raosoft.
A number of excellent resources are available in the Internet. They include:
American Statistical Association Section on Survey Research
http://www.amstat.org/sections/srms/
StatPack Survey Software
http://www.statpac.com/surveys/
NCS Pearson
http://www.pearsonncs.com/research-notes/index.htm
National Multiple Sclerosis society
http://www.nationalmssociety.org/MUCS_glossary.asp
EXERCISES
1. What are the difficulties, if any, with the following surveys:
a. Questionnaires handed out on an airplane to measure customer satisfaction
b. A poll in which people sign onto the Internet and vote on an issue
c. Satisfaction questionnaires mailed to all patients by hospitals and clinics
d. Requiring students to hand in a course evaluation before receiving their final grade
2. If you wanted to analyze the information in Table 11-6 using statistical methods, what procedure(s) would you use?
3. The caption at the bottom of Table 2 in the published article by Rogers and colleagues (2002) states that a more conservative P value for of 0.01 was used to determine the statistical significance of the correlations. Why do you think the authors did this? Do alternative options exist for researchers?
4. Refer to Tables 11-7 and 11-8, the correlations and multiple regression performed by Rogers and colleagues. The three largest correlations in Table 11-7 are confidence in counseling skills, feeling successful at counseling, and perceiving exercise counseling as a high priority. The first two variables entered the stepwise regression (Table 11-8); why didn't the high-priority variable also enter the regression?
5. A pediatrician is planning to survey the parents of a random sample of patients and wants to know how many hours their child watches TV each month. How would you ask this question?
6. A clinic manager wants to survey a random sample of patients to learn how they view some recent changes made in the clinic operation. The manager has drafted a questionnaire and wants you to review it. One of the questions asks, “Do you agree that the new clinic hours are an improvement over the old ones?” What advice will you give the manager about the wording of this question?
Table 11-10. Factors associated with antibiotic prescribing for uncomplicated urinary tract infections (UTIs) among primary care physicians (n = 561).a |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
7. Suppose you would like to know how far physicians are willing to travel to attend continuing-education course, assuming that some number of hours is required each year. In addition, you want to learn topics they would like to have included in future programs. You plan to publish the results of your study. How would you select the sample of physicians to include in your survey?
a. All physicians who attended last year's programs
b. All physicians who attend the two upcoming programs
c. A random sample of physicians who attended last year's programs
d. A random sample of physicians obtained from a list maintained by the state medical society
e. A random sample of physicians in each county obtained from a list maintained by the county medical societies
8. Huang and Stafford (2002) prepared a table of patient and visit characteristics that were associated with specific antibiotics prescribed for uncomplicated urinary tract infections (UTI); the table is reproduced in Table 11-10.
a. What patient characteristics are significantly associated with trimethoprim–sulfamethoxazole?
b. What does it mean that South (under region) is the referent?
c. Is primary care specialty significantly related to prescribing any of the drugs?