Experimental Design and Statistics
Nathan Leon Pace
1. Statistics and mathematics are the language of scientific medicine.
2. Good research planning includes a clear biologic hypothesis, the specification of outcome variables, the choice of anticipated statistical methods, and sample size planning.
3. To avoid bias in the performance of clinical research, the crucial elements of good research design include concurrent control groups, random allocation of subjects to treatment groups, and blinding of random allocation to patients, caregivers, and outcome assessors.
4. Descriptive (e.g., mean, standard deviation) and inferential statistics (e.g., t test, confidence interval) are both essential methods for the presentation of research results.
5. The central limit theorem allows the use of parametric statistics for most statistical testing.
6. Systematic review and meta-analysis can synthesize and summarize the results of smaller, nonsignificant individual studies and permit more powerful inferences.
Medical journals are replete with numbers. These include weights, lengths, pressures, volumes, flows, concentrations, counts, temperatures, rates, currents, energies, and forces. The analysis and interpretation of these numbers require the use of statistical techniques. The design of the experiment to acquire these numbers is also part of statistical competence. The need for these statistical techniques is mandated by the nature of our universe, which is both orderly and random at the same time. The methods of probability and statistics have been formulated to solve concrete problems, such as betting on cards, understanding biologic inheritance, and improving food processing. Studies in anesthesia have even inspired new statistics. The development of statistical techniques is manifest in the increasing use of more sophisticated research designs and statistical tests in anesthesia research.
If a physician is to be a practitioner of scientific medicine, he or she must read the language of science to be able to independently assess and interpret the scientific report. Without exception, the language of the medical report is increasingly statistical. Readers of the anesthesia literature, whether in a community hospital or a university environment, cannot and should not totally depend on the editors of journals to banish all errors of statistical analysis and interpretation. In addition, there are regularly questions about simple statistics in examinations required for anesthesiologists. Finally, certain statistical methods have everyday applications in clinical medicine. This chapter briefly scans some elements of experimental design and statistical analysis.
Design of Research Studies
The scientific investigator should view himself or herself as an experimenter and not merely as a naturalist. The naturalist goes out into the field ready to capture and report the numbers that flit into view; this is a worthy activity, typified by the case report. Case reports engender interest, suspicion, doubt, wonder, and perhaps the desire to experiment; however, the case report is not sufficient evidence to advance scientific medicine. The experimenter attempts to constrain and control, as much as
possible, the environment in which he or she collects numbers to test a hypothesis. The elements of experimental design are intended to prevent and minimize the possibility of bias, that is, a deviation of results or inferences from the truth.
Two words of great importance to statisticians are population and sample. In statistical language, each has a specialized meaning. Instead of referring only to the count of individuals in a geographic or political region, population refers to any target group of things (animate or inanimate) in which there is interest. For anesthesia researchers, a typical target population might be mothers in the first stage of labor or head-trauma victims undergoing craniotomy. A target population could also be cell cultures, isolated organ preparations, or hospital bills. A sample is a subset of the target population. Samples are taken because of the impossibility of observing the entire population; it is generally not affordable, convenient, or practical to examine more than a relatively small fraction of the population. Nevertheless, the researcher wishes to generalize from the results of the small sample group to the entire population.
Although the subjects of a population are alike in at least one way, these population members are generally quite diverse in other ways. Because the researcher can work only with a subset of the population, he or she hopes that the sample of subjects in the experiment is representative of the population's diversity. Head-injury patients can have open or closed wounds, a variety of coexisting diseases, and normal or increased intracranial pressure. These subgroups within a population are called strata. Often the researcher wishes to increase the sameness or homogeneity of the target population by further restricting it to just a few strata; perhaps only closed and not open head injuries will be included. Restricting the target population to eliminate too much diversity must be balanced against the desire to have the results be applicable to the broadest possible population of patients.
The best hope for a representative sample of the population would be realized if every subject in the population had the same chance of being in the experiment; this is calledrandom sampling. If there were several strata of importance, random sampling from each stratum would be appropriate. Unfortunately, in most clinical anesthesia studies researchers are limited to using those patients who happen to show up at their hospitals; this is called convenience sampling. Convenience sampling is also subject to the nuances of the surgical schedule, the goodwill of the referring physician and attending surgeon, and the willingness of the patient to cooperate. At best, the convenience sample is representative of patients at that institution, with no assurance that these patients are similar to those elsewhere. Convenience sampling is also the rule in studying new anesthetic drugs; such studies are typically performed on healthy, young volunteers.
The researcher must define the conditions to which the sample members will be exposed. Particularly in clinical research, one must decide whether these conditions should be rigidly standardized or whether the experimental circumstances should be adjusted or individualized to the patient. In anesthetic drug research, should a fixed dose be given to all members of the sample or should the dose be adjusted to produce an effect or to achieve a specific end point? Standardizing the treatment groups by fixed doses simplifies the research work. There are risks to this standardization, however: (1) a fixed dose may produce excessive numbers of side effects in some patients, (2) a fixed dose may be therapeutically insufficient in others, and (3) a treatment standardized for an experimental protocol may be so artificial that it has no broad clinical relevance, even if demonstrated to be superior. The researcher should carefully choose and report the adjustment/individualization of experimental treatments.
Even if a researcher is studying just one experimental group, the results of the experiment are usually not interpreted solely in terms of that one group but are also contrasted and compared with other experimental groups. Examining the effects of a new drug on blood pressure during anesthetic induction is important, but what is more important is comparing those results with the effects of one or more standard drugs commonly used in the same situation. Where can the researcher obtain these comparative data? There are several possibilities: (1) each patient could receive the standard drug under identical experimental circumstances at another time, (2) another group of patients receiving the standard drug could be studied simultaneously, (3) a group of patients could have been studied previously with the standard drug under similar circumstances, and (4) literature reports of the effects of the drug under related but not necessarily identical circumstances could be used. Under the first two possibilities, the control group is contemporaneous—either a self-control (crossover) or parallel control group. The second two possibilities are examples of the use of historical controls.
Because historical controls already exist, they are convenient and seemingly cheap to use. Unfortunately, the history of medicine is littered with the “debris” of therapies enthusiastically accepted on the basis of comparison with past experience. A classic example is operative ligation of the internal mammary artery for the treatment of angina pectoris—a procedure now known to be of no value. Proposed as a method to improve coronary artery blood flow, the lack of benefit was demonstrated in a trial where some patients had the procedure and some had a sham procedure; both groups showed benefit.1 There is now firm empirical evidence that studies using historical controls usually show a favorable outcome for a new therapy, whereas studies with concurrent controls, that is, parallel control group or self-control, less often reveal a benefit.2 Nothing seems to increase the enthusiasm for a new treatment as much as the omission of a concurrent control group. If the outcome with an old treatment is not studied simultaneously with the outcome of a new treatment, one cannot know if any differences in results are a consequence of the two treatments, or of unsuspected and unknowable differences between the patients, or of other changes over time in the general medical environment. One possible exception would be in studying a disease that is uniformly fatal (100% mortality) over a very short time.
Random Allocation of Treatment Groups
Having accepted the necessity of an experiment with a control group, the question arises as to the method by which each subject should be assigned to the predetermined experimental groups. Should it depend on the whim of the investigator, the day of the week, the preference of a referring physician, the wish of the patient, the assignment of the previous subject, the availability of a study drug, a hospital chart number, or some other arbitrary criterion? All such methods have been used and are still used, but all can ruin the purity and usefulness of the experiment. It is important to remember the purpose of sampling: by exposing a small number of subjects from the target
population to the various experimental conditions, one hopes to make conclusions about the entire population. Thus, the experimental groups should be as similar as possible to each other in reflecting the target population; if the groups are different, bias is introduced into the experiment. Although randomly allocating subjects of a sample to one or another of the experimental groups requires additional work, this principle prevents selection bias by the researcher, minimizes (but cannot always prevent) the possibility that important differences exist among the experimental groups, and disarms the critics' complaints about research methods. Random allocation is most commonly accomplished by the use of computer-generated random numbers.
Blinding refers to the masking from the view of patient and experimenters the experimental group to which the subject has been or will be assigned. In clinical trials, the necessity for blinding starts even before a patient is enrolled in the research study; this is called the concealment of random allocation. There is good evidence that, if the process of random allocation is accessible to view, the referring physicians, the research team members, or both are tempted to manipulate the entrance of specific patients into the study to influence their assignment to a specific treatment group3; they do so having formed a personal opinion about the relative merits of the treatment groups and desiring to get the “best” for someone they favor. This creates bias in the experimental groups.
Each subject should remain, if possible, ignorant of the assigned treatment group after entrance into the research protocol. The patient's expectation of improvement, a placebo effect, is a real and useful part of clinical care. But when studying a new treatment, one must ensure that the fame or infamy of the treatments does not induce a bias in outcome by changing patient expectations. A researcher's knowledge of the treatment assignment can bias his or her ability to administer the research protocol and to observe and record data faithfully; this is true for clinical, animal, and in vitro research. If the treatment group is known, those who observe data cannot trust themselves to record the data impartially and dispassionately. The appellations single-blind and double-blind to describe blinding are commonly used in research reports, but often applied inconsistently; the researcher should carefully plan and report exactly who is blinded.
Types of Research Design
Ultimately, research design consists of choosing what subjects to study, what experimental conditions and constraints to enforce, and which observations to collect at what intervals. A few key features in this research design largely determine the strength of scientific inference on the collected data. These key features allow the classification of research reports (Table 9-1). This classification reveals the variety of experimental approaches and indicates strengths and weaknesses of the same design applied to many research problems.
The first distinction is between longitudinal and crosssectional studies. The former is the study of changes over time, whereas the latter describes a phenomenon at a certain point in time. For example, reporting the frequency with which certain drugs are used during anesthesia is a cross-sectional study, whereas investigating the hemodynamic effects of different drugs during anesthesia is a longitudinal one.
Longitudinal studies are next classified by the method with which the research subjects are selected. These methods for choosing research subjects can be either prospective orretrospective; these two approaches are also known as cohort (prospective) or case-control (retrospective). A prospective study assembles groups of subjects by some input characteristic that is thought to change an output characteristic; a typical input characteristic would be the opioid drug administered during anesthesia; for example, remifentanil or fentanyl. A retrospective study gathers subjects by an output characteristic; an output characteristic is the status of the subject after an event; for example, the occurrence of a myocardial infarction. A prospective (cohort) study would be one in which a group of patients undergoing neurologic surgery was divided in two groups, given two different opioids (remifentanil or fentanyl), and followed for the development of a perioperative myocardial infarction. In a retrospective (case-control) study, patients who suffered a perioperative myocardial infarction would be identified from hospital records; a group of subjects of similar age, gender, and disease who did not suffer a perioperative myocardial infarction also would be chosen, and the two groups would then be compared for the relative use of the two opioids (remifentanil or fentanyl). Retrospective studies are a primary tool of epidemiology. A case-control study can often identify an association between an input and output characteristic, but the causal link or relationship between the two is more difficult to specify.
Table 9-1 Classification of Clinical Research Reports
Prospective studies are further divided into those in which the investigator performs a deliberate intervention and those in which the investigator merely observes. In a study ofdeliberate intervention, the investigator would choose several anesthetic maintenance techniques and compare the incidence of postoperative nausea and vomiting. If it was performed as an observational study, the investigator would observe a group of patients receiving anesthetics chosen at the discretion of each patient's anesthesiologist and compare the incidence of postoperative nausea and vomiting among the anesthetics used. Obviously, in this example of an observational study, there has been an intervention; an anesthetic has been given. The crucial distinction is whether the investigator controlled the intervention. An observational study may reveal differences among treatment groups, but whether such differences are the consequence of the treatments or of other differences among the patients receiving the treatments will remain obscure.
Studies of deliberate intervention are further subdivided into those with concurrent controls and those with historical controls. Concurrent controls are either a simultaneous parallel control group or a self-control study; historical controls include previous studies and literature reports. A randomized controlled trial is thus a longitudinal, prospective study of deliberate intervention with concurrent controls.
Although most of this discussion about experimental design has focused on human experimentation, the same principles apply and should be followed in animal experimentation. The randomized, controlled clinical trial is the most potent scientific tool for evaluating medical treatment; randomization into treatment groups is relied on to equally weight the subjects of
the treatment groups for baseline attributes that might predispose or protect the subjects from the outcome of interest.
Data and Descriptive Statistics
Statistics is a method for working with sets of numbers, a set being a group of objects. Statistics involves the description of number sets, the comparison of number sets with theoretical models, comparison between number sets, and comparison of recently acquired number sets with those from the past. A typical scientific hypothesis asks which of two methods (treatments), X and Y, is better. A statistical hypothesis is formulated concerning the sets of numbers collected under the conditions of treatments X and Y. Statistics provides methods for deciding if the set of values associated with X are different from the values associated with Y. Statistical methods are necessary because there are sources of variation in any data set, including random biologic variation and measurement error. These errors in the data cause difficulties in avoiding bias and in being precise. Bias keeps the true value from being known and fosters incorrect decisions; precision deals with the problem of the data scatter and with quantifying the uncertainty about the value in the population from which a sample is drawn. These statistical methods are relatively independent of the particular field of study. Regardless of whether the numbers in sets X and Y are systolic pressures, body weights, or serum chlorides, the approach for comparing sets X and Y is usually the same.
Data collected in an experiment include the defining characteristics of the experiment and the values of events or attributes that vary over time or conditions. The former are calledexplanatory variables and the latter are called response variables. The researcher records his or her observations on data sheets or case record forms, which may be one to many pages in length, and assembles them together for statistical analysis. Variables such as gender, age, and doses of accompanying drugs reflect the variability of the experimental subjects. Explanatory variables, it is hoped, explain the systematic variations in the response variables. In a sense, the response variables depend on the explanatory variables.
Response variables are also called dependent variables. Response variables reflect the primary properties of experimental interest in the subjects. Research in anesthesiology is particularly likely to have repeated measurement variables; that is, a particular measurement recorded more than once for each individual. Some variables can be both explanatory and response; these are called intermediate response variables. Suppose an experiment is conducted comparing electrocardiography and myocardial responses between five doses of an opioid. One might analyze how ST segments depended on the dose of opioids; here, maximum ST segment depression is a response variable. Maximum ST segment depression might also be used as an explanatory variable to address the subtler question of the extent to which the effect of an opioid dose on postoperative myocardial infarction can be accounted for by ST segment changes.
The mathematical characteristics of the possible values of a variable fit into five classifications (Table 9-2). Properly assigning a variable to the correct data type is essential for choosing the correct statistical technique. For interval variables, there is equal distance between successive intervals; the difference between 15 and 10 is the same as the difference between 25 and 20. Discrete interval data can have only integer values; for example, number of living children. Continuous interval data are measured on a continuum and can be a decimal fraction; for example, blood pressure can be described as accurately as desired (e.g., 136, 136.1, or 136.14 mm Hg). The same statistical techniques are used for discrete and continuous data.
Putting observations into two or more discrete categories derives categorical variables; for statistical analysis, numeric values are assigned as labels to the categories. Dichotomous data allow only two possible values; for example, male versus female. Ordinal data have three or more categories that can logically be ranked or ordered; however, the ranking or ordering of the variable indicates only relative and not absolute differences between values; there is not necessarily the same difference between American Society of Anesthesiologists Physical Status score I and II as there is between III and IV. Although ordinal data are often treated as interval data in choosing a statistical technique, such analysis may be suspect; alternative techniques for ordinal data are available. Nominal variables are placed into categories that have no logical ordering. The eye colors blue, hazel, and brown might be assigned the numbers 1, 2, and 3, but it is nonsense to say that blue < hazel < brown.
Table 9-2 Data Types
A typical hypothetical data set could be a sample of ages (the response or dependent variable) of 12 residents in an anesthesia training program (the population). Although the results of a particular experiment might be presented by repeatedly showing the entire set of numbers, there are concise ways of summarizing the information content of the data set into a few numbers. These numbers are called sample or summary statistics; summary statistics are calculated using the numbers of the sample. By convention, the symbols of summary statistics are roman letters. The two summary statistics most frequently used for interval variables are the central location and the variability, but there are other summary statistics. Other data types have analogous summary statistics. Although the first purpose of descriptive statistics is to describe the sample of numbers obtained, there is also the desire to use the summary statistics from the sample to characterize the population from which the sample was obtained. For example, what can be said about the age of all anesthesia residents from the information in a sample? The population also has measures of central location and variability called the parameters of the population; Greek letters denote population parameters. Usually, the population parameters cannot be directly calculated because data from all population members cannot be obtained. The beauty of properly chosen summary statistics is that they are the best possible estimators of the population parameters.
These sampling statistics can be used in conjunction with a probability density function to provide additional descriptions of the sample and its population. Also commonly described as a probability distribution, a probability density function is an algebraic equation, f(x), which gives a theoretical percentage distribution of x. Each value of x has a probability of occurrence given by f(x). The most important probability distribution is the normal or Gaussian function
ters (population mean and population variance) in the equation of the normal function that are denoted µ and σ2. Often called the normal equation, it can be plotted and produces the familiar bell-shaped curve. Why are the mathematical properties of this curve so important to biostatistics? First, it has been empirically noted that when a biologic variable is sampled repeatedly, the pattern of the numbers plotted as a histogram resembles the normal curve; thus, most biologic data are said to follow or to obey a normal distribution. Second, if it is reasonable to assume that a sample is from a normal population, the mathematical properties of the normal equation can be used with the sampling statistic estimators of the population parameters to describe the sample and the population. Third, a mathematical theorem (the central limit theorem) allows the use of the assumption of normality for certain purposes, even if the population is not normally distributed.
The three most common summary statistics of central location for interval variables are the arithmetic mean, the median, and the mode. The mean is merely the average of the numbers in the data set. Being a summary statistic of the sample, the arithmetic mean is denoted by the Roman letter x under a bar or
count of objects in the sample. If all values in the population could be obtained, then the population mean µ could be calculated similarly. Because all values of the population cannot be obtained, the sample mean is used. (Statisticians describe the sample mean as the unbiased, consistent, minimum variance, sufficient estimator of the population mean. Estimators are denoted by a hat over a roman letter; for example, . Thus, the sample mean [x with bar above] is the estimator of the population mean µ.)
The median is the middlemost number or the number that divides the sample into two equal parts—first, ranking the sample values from lowest to highest and then counting up halfway to obtain the median. The concept of ranking is used in nonparametric statistics. A virtue of the median is that it is hardly affected by a few extremely high or low values. The mode is the most popular number of a sample; that is, the number that occurs most frequently. A sample may have ties for the most common value and be bi-or polymodal; these modes may be widely separated or adjacent. The raw data should be inspected for this unusual appearance. The mode is always mentioned in discussions of descriptive statistics, but it is rarely used in statistical practice.
Spread or Variability
Any set of interval data has variability unless all the numbers are identical. The range of ages from lowest to highest expresses the largest difference. This spread, diversity, and variability can also be expressed in a concise manner. Variability is specified by calculating the deviation or deviate of each individual xi from the center (mean) of all the xi's. Thesum of the squared deviates is always positive unless all set values are identical. This sum is then divided by the number of individual measurements. The result is the averaged squared deviation; the average squared deviation is ubiquitous in statistics.
The concept of describing the spread of a set of numbers by calculating the average distance from each number to the center of the numbers applies to both a sample and a population; this average squared distance is called the variance. The population variance is a parameter and is represented by σ2. As with the population mean, the population variance is not usually known and cannot be calculated. Just as the sample mean is used in place of the population mean, the sample variance is used in place of the population variance. The sample variance
Statistical theory demonstrates that if the divisor in the formula for SD2 is (n - 1) rather than n, the sample variance is an unbiased estimator of the population variance. While the variance is used extensively in statistical calculations, the units of variance are squared units of the original observations. The square root of the variance has the same units as the original observations; the square roots of the sample and population variances are called the sample (SD) and population (σ) standard deviations.
It was previously mentioned that most biologic observations appear to come from populations with normal distributions. By accepting this assumption of a normal distribution, further meaning can be given to the sample summary statistics (mean and SD) that have been calculated. This involves the use of the expression [x with bar above] ± K × SD where k = 1, 2, 3, and so forth. If the population from which the sample is taken is unimodal and roughly symmetric, then the bounds for 1, 2, and 3 encompasses roughly 68%, 95%, and 99% of the sample and population members.
Hypotheses and Parameters
The researcher starts work with some intuitive feel for the phenomenon to be studied. Whether stated explicitly or not, this is the biologic hypothesis; it is a statement of experimental expectations to be accomplished by the use of experimental tools, instruments, or methods accessible to the research team. An example would be the hope that isoflurane would produce less myocardial ischemia than fentanyl; the experimental method might be the electrocardiography determination of ST segment changes. The biologic hypothesis of the researcher becomes a statistical hypothesis during research planning. The researcher measures quantities that can vary—variables such as heart rate or temperature or ST segment change—in samples from populations of interest. In a statistical hypothesis, statements are made about the relationship among parameters of one or more populations. (To restate, a parameter is a number describing a variable of a population; Greek letters are used to denote parameters.) The typical statistical hypothesis can be established in a somewhat rote fashion for every research project, regardless of the methods, materials, or goals. The most frequently used method of setting up the algebraic formulation of the statistical hypothesis is to create two mutually exclusive statements about some parameters of the study population (Table 9-3); estimates for the values for these parameters are acquired by sampling data. In the hypothetical example comparing isoflurane and fentanyl, φ1 and φ2 would represent the ST segment changes with isoflurane and with fentanyl. Thenull hypothesis is the hypothesis of no difference of ST segment changes between isoflurane and fentanyl. The alternative hypothesis is usually nondirectional, that is, either φ1 < φ2or φ1 > φ2; this is known as a two-tail alternative hypothesis. This is a more conservative alternative hypothesis than assuming that the inequality can only be either less than or greater than.
Logic of Proof
One particular decision strategy is used most commonly to choose between the null and alternative hypothesis. The decision strategy is similar to a method of indirect proof used in mathematics called reductio ad absurdum (proof by contradiction). If a theorem cannot be proved directly, assume that it is not true; show that the falsity of this theorem will lead to contradictions and absurdities; thus, reject the original assumption of the falseness of the theorem. For statistics, the approach is to assume that the null hypothesis is true even though the goal of the experiment is to show that there is a difference. One examines the consequences of this assumption by examining the actual sample values obtained for the variable(s) of interest. This is done by calculating what is called a sample test statistic; sample test statistics are calculated from the sample numbers. Associated with a sample test statistic is a probability. One also chooses the level of significance; the level of significance is the probability level considered too low to warrant support of the null hypothesis being tested. If sample values are sufficiently unlikely to have occurred by chance (i.e., the probability of the sample test statistic is less than the chosen level of significance), the null hypothesis is rejected; otherwise, the null hypothesis is not rejected.
Table 9-3 Algebraic Statement of Statistical Hypotheses
Because the statistics deal with probabilities, not certainties, there is a chance that the decision concerning the null hypothesis is erroneous. These errors are best displayed in table form (Table 9-4); condition 1 and condition 2 could be different drugs, two doses of the same drug, or different patient groups. Of the four possible outcomes, two decisions are clearly undesirable. The error of wrongly rejecting the null hypothesis (false-positive) is called the type I or alpha error. The experimenter should choose a probability value for alpha before collecting data; the experimenter decides how cautious to be against falsely claiming a difference. The most common choice for the value of alpha is 0.05. What are the consequences of choosing an alpha of 0.05? Assuming that there is, in fact, no difference between the two conditions and that the experiment is to be repeated 20 times, then during one of these
experimental replications (5% of 20) a mistaken conclusion that there is a difference would be made. The probability of a type I error depends on the chosen level of significance and the existence or nonexistence of a difference between the two experimental conditions. The smaller the chosen alpha, the smaller will be the risk of a type I error.
Table 9-4 Errors in Hypothesis Testing: the Two-Way Truth Table
The error of failing to reject a false null hypothesis (false-negative) is called a type II or beta error. (The power of a test is 1 minus beta). The probability of a type II error depends on four factors. Unfortunately, the smaller the alpha, the greater the chance of a false-negative conclusion; this fact keeps the experimenter from automatically choosing a very small alpha. Second, the more variability there is in the populations being compared, the greater the chance of a type II error. This is analogous to listening to a noisy radio broadcast; the more static there is, the harder it will be to discriminate between words. Next, increasing the number of subjects will lower the probability of a type II error. The fourth and most important factor is the magnitude of the difference between the two experimental conditions. The probability of a type II error goes from very high, when there is only a small difference, to extremely low, when the two conditions produce large differences in population parameters.
Sample Size Calculations
Formerly, researchers typically ignored the latter error in experimental design. The practical importance of worrying about type II errors reached the consciousness of the medical research community several decades ago. Some controlled clinical trials that claimed to find no advantage of new therapies compared with standard therapies lacked sufficient statistical power to discriminate between the experimental groups and would have missed an important therapeutic improvement. There are four options for decreasing type II error (increasing statistical power): (1) raise alpha, (2) reduce population variability, (3) make the sample bigger, and (4) make the difference between the conditions greater. Under most circumstances, only the sample size can be varied. Sample size planning has become an important part of research design for controlled clinical trials. Some published research still fails the test of adequate sample size planning.
The testing of hypotheses or significance testing has been the main focus of inferential statistics. Hypothesis testing allows the experimenter to use data from the sample to make inferences about the population. Statisticians have created formulas that use the values of the samples to calculate test statistics. Statisticians have also explored the properties of various theoretical probability distributions. Depending on the assumptions about how data are collected, the appropriate probability distribution is chosen as the source of critical values to accept or reject the null hypothesis. If the value of the test statistic calculated from the sample(s) is greater than the critical value, the null hypothesis is rejected. The critical value is chosen from the appropriate probability distribution after the magnitude of the type I error is specified.
There are parameters within the equation that generate any particular probability distribution; for the normal probability distribution, the parameters are µ and σ2. For the normal distribution, each set of values for µ and σ2 will generate a different shape for the bell-like normal curve. All probability distributions contain one or more parameters and can be plotted as curves; these parameters may be discrete (integer only) or continuous. Each value or combination of values for these parameters will create a different curve for the probability distribution being used. Thus, each probability distribution is actually a family of probability curves. Some additional parameters of theoretical probability distributions have been given the special name degrees of freedom and are represented by Latin letters such as m, n, and s.
Associated with the formula for computing a test statistic is a rule for assigning integer values to the one or more parameters called degrees of freedom. The number of degrees of freedom and the value for each degree of freedom depend on (1) the number of subjects, (2) the number of experimental groups, (3) the specifics of the statistical hypothesis, and (4) the type of statistical test. The correct curve of the probability distribution from which to obtain a critical value for comparison with the value of the test statistic is obtained with the values of one or more degrees of freedom.
To accept or reject the null hypothesis, the following steps are performed: (1) confirm that experimental data conform to the assumptions of the intended statistical test; (2) choose a significance level (alpha); (3) calculate the test statistic; (4) determine the degree(s) of freedom; (5) find the critical value for the chosen alpha and the degree(s) of freedom from the appropriate probability distribution; (6) if the test statistic exceeds the critical value, reject the null hypothesis; (7) if the test statistic does not exceed the critical value, do not reject the null hypothesis. There are general guidelines that relate the variable type and the experimental design to the choice of statistical test (Table 9-5).
The other major areas of statistical inference are the estimation of parameters with associated confidence intervals (CIs). In statistics, a CI is an interval estimate of a population parameter. A CI describes how likely it is that the population
parameter is estimated by any particular sample statistic such as the mean. (The technical definition of the CI of the mean is more rigorous. A 95% CI implies that if the experiment were done over and over again, 95 of each 100 CIs would be expected to contain the true value of the mean.) CIs are a range of the following form: summary statistic ± (confidence factor) × (precision factor).
Table 9-5 When to use What
The precision factor is derived from the sample itself, whereas the confidence factor is taken from a probability distribution and also depends on the specified confidence level chosen. For a sample of interval data taken from a normally distributed population for which CIs are to be chosen for [x with bar above], the precision factor is called the standard error of the mean and is obtained by dividing SD by the square root of the sample size
The confidence factors are the same as those used for the dispersion or spread of the sample and are obtained from the normal distribution. The CIs for confidence factors 1, 2, and 3 have roughly a 68%, 95%, and 99% chance of containing the population mean. Strictly speaking, when the SD must be estimated from sample values, the confidence factors should be taken from the t distribution, another probability distribution. These coefficients will be larger than those used previously. This is usually ignored if the sample size is reasonable; for example, n > 25. Even when the sample size is only five or greater, the use of the coefficients 1, 2, and 3 is simple and sufficiently accurate for quick mental calculations of CIs on parameter estimates.
Almost all research reports include the use of SE, regardless of the probability distribution of the populations sampled. This use is a consequence of the central limit theorem, one of the most remarkable theorems in all of mathematics. The central limit theorem states that the SE can always be used, if the sample size is sufficiently large, to specify CIs around the sample mean. These CIs are calculated as previously described. This is true even if the population distribution is so different from normal that SD cannot be used to characterize the dispersion of the population members. Only rough guidelines can be given for the necessary sample size; for interval data, 25 and above is large enough and 4 and below is too small.
Although the SE is often discussed along with other descriptive statistics, it is really an inferential statistic. SE and SD are usually mentioned together because of their similarities of computation, but there is often confusion about their use in research reports in the form “mean ± number.” Some confusion results from the failure of the author to specify whether the number after the ± sign is the one or the other. More important, the choice between using SD and using SE has become controversial. Because SE is always less than SD, it has been argued that authors seek to deceive by using SE to make the data look better than they really are. The choice is actually simple. When describing the spread, scatter, or dispersion of the sample, use SD; when describing the precision with which the population mean is known, use SE.
Confidence Intervals on Proportions
Categorical binary data, also called enumeration data, provide counts of subject responses. Given a sample of subjects of whom some have a certain characteristic (e.g., death, female sex), a ratio of responders to the number of subjects can be easily calculated as p = x/n; this ratio or rate can be expressed as a decimal fraction or as a percentage. It should be clear that this is a measure of central location of a binary data in the same way that µ was a measure of central location for continuous data. In the population from which the sample is taken, the ratio of responders to total subjects is a population parameter, denoted π; π is the measure of central location for the population. (This is not related to the geometry constant π = 3.14159…). As with other data types, π is usually not known, but must be estimated from the sample. The sample ratio p is the best estimate of π. The probability of binary data is provided by the binomial distribution function.
Because the population is not generally known, the experimenter usually wishes to estimate π by the sample ratio p and to specify with what confidence π is known. If the sample is sufficiently large (n × p ≥ 5; n × (1 – p) ≥ 5), advantage is taken of the central limit theorem to derive an SE analogous to
sample SE is exactly analogous to the sample SE of the mean for interval data, except that it is an SE of the proportion. Just as a 95% CI of the mean was calculated, so may a CI on the proportion may be obtained. Larger samples will make the CI more precise.
Statistical Tests and Models
Dichotomous Data Testing
In the experiment negating the value of mammary artery ligation, five of eight patients (62.5%) having ligation showed benefit while five of nine patients (55.6%) having sham surgery also had benefit.1 Is this difference real? This experiment sampled patients from two populations—those having the real procedure and those having the sham procedure. A variety of statistical techniques allow a comparison of the success rate. These include Fisher's exact test and (Pearson's) chi-square test. The chi-square test offers the advantage of being computationally simpler; it can also analyze contingency tables with more than two rows and two columns; however, certain assumptions of sample size and response rate are not achieved by this experiment. Fisher's exact test fails to reject the null hypothesis for this data.
The results of such experiments are often presented as rate ratios. The rate of improvement for the experimental group (5/8 = 62.5%) is divided by the rate of improvement for the control group (5/9 = 55.6%). A rate ratio of 1.00 (100%) fails to show a difference of benefit or harm between the two groups. In this example the rate ratio is 1.125. Thus, the experimental group had a 12.5% greater chance of improvement compared with the control group. A CI can be calculated for the rate ratio; in this example it is (0.40, 3.13), thus widely spread to either side of the rate ratio of no difference. (If such experiment were performed now, the sample size would be much larger to have adequate statistical power.)
Interval Data Testing
Parametric statistics are the usual choice in the analysis of interval data, both discrete and continuous. The purpose of such analysis is to test the hypothesis of a difference between population means. The population means are unknown and are estimated by the sample means. A typical example would be the comparison of the mean heart rates of patients receiving and not receiving atropine. Parametric test statistics have been developed by using the properties of the normal probability distribution and two related probability distributions, the t and the F distributions. In using such parametric methods, the assumption is made that the sample or samples is/are drawn from population(s) with a normal distribution. The parametric
test statistics that have been created for interval data all have the form of a ratio. In general terms, the numerator of this ratio is the variability of the means of the samples; the denominator of this ratio is the variability among all the members of the samples. These variabilities are similar to the variances developed for descriptive statistics. The test statistic is thus a ratio of variabilities or variances. All parametric test statistics are used in the same fashion; if the test statistic ratio becomes large, the null hypothesis of no difference is rejected. The critical values against which to compare the test statistic are taken from tables of the three relevant probability distributions (normal, t, or F). In hypothesis testing at least one of the population means is unknown, but the population variance(s) may or may not be known. Parametric statistics can be divided into two groups according to whether or not the population variances are known. If the population variance is known, the test statistic used is called the z score; critical values are obtained from the normal distribution. In most biomedical applications, the population variance is rarely known and the z score is little used.
An important advance in statistical inference came early in the 20th century with the creation of Student's t test statistic and the t distribution, which allowed the testing of hypotheses when the population variance is not known. The most common use of Student's t test is to compare the mean values of two populations. There are two types of t test. If each subject has two measurements taken, for example, one before (xi) and one after (yi) a drug, then a one sample or paired t test procedure is used; each control measurement taken before drug administration is paired with a measurement in the same patient after drug administration. Of course, this is a selfcontrol experiment. This pairing of measurements in the same patient reduces variability and increases statistical power. The difference di = xi - yi of each pair of values is calculated and the average [d with bar above] is calculated. In the formula for Student's t statistic, the numerator is [d with bar above], whereas the denominator is the SE
All t statistics are created in this way; the numerator is the difference of two means, whereas the denominator is the SE of the two means. If the difference between the two means is large compared with their variability, then the null hypothesis of no difference is rejected. The critical values for the t statistic are taken from the t probability distribution. The tdistribution is symmetric and bell-shaped but more spread out than the normal distribution. The t distribution has a single integer parameter; for a paired t test, the value of this single degree of freedom is the sample size minus one. There can be some confusion about the use of the letter t. It refers both to the value of the test statistic calculated by the formula and to the critical value from the theoretical probability distribution. The critical t value is determined by looking in a t table after a significance level is chosen and the degree of freedom is computed.
More commonly, measurements are taken on two separate groups of subjects. For example, one group receives blood pressure treatment with sample values xi, whereas no treatment is given to a control group with sample values yi. The number of subjects in each group might or might not be identical; regardless of this, in no sense is an individual measurement in the first group matched or paired with a specific measurement in the second group. An unpaired or two-sample t test is used to compare the means of the two groups. The numerator of the t statistic is [x with bar above] - [Y with bar above]. The denominator is a weighted average of the SDs of each sample so that the test
The degree of freedom for an unpaired t test is calculated as the sum of the subjects of the two groups minus two. As with the paired t test, if the t ratio becomes large, the null hypothesis is rejected.
Analysis of Variance
Experiments in anesthesia, whether they are with humans or with animals, may not be limited to one or two groups of data for each variable. It is very common to follow a variable longitudinally; heart rate, for example, might be measured five times before and during anesthetic induction. These are also called repeated measurement experiments; the experimenter will wish to compare changes between the initial heart rate measurement and those obtained during induction. The experimental design might also include several groups receiving different induction drugs; for example, comparing heart rate across groups immediately after laryngoscopy. Researchers have mistakenly handled these analysis problems with just the t test. If heart rate is collected five times, these collection times could be labeled A, B, C, D, and E. Then A could be compared with B, C, D, and E; B could be compared with C, D, and E; and so forth. The total of possible pairings is ten; thus, ten paired t tests could be calculated for all the possible pairings of A, B, C, D, and E. A similar approach can be used for comparing more than two groups for unpaired data.
The use of t tests in this fashion is inappropriate. In testing a statistical hypothesis, the experimenter sets the level of type I error; this is usually chosen to be 0.05. When using manyt tests, as in the example given earlier, the chosen error rate for performing all these t tests is much higher than 0.05, even though the type I error is set at 0.05 for each individual comparison. In fact, the type I error rate for all t tests simultaneously; that is, the chance of finding at least one of the multiple t test statistics significant merely by chance is given by the formula α = 1 - 0.95κ. If 13 t tests are performed (κ = 13), the real error rate is 49%. Applying t tests over and over again to all the possible pairings of a variable will misleadingly identify statistical significance when in fact there is none.
The most versatile approach for handling comparisons of means between more than two groups or between several measurements in the same group is called analysis of variance and is frequently cited by the acronym ANOVA. Analysis of variance consists of rules for creating test statistics on means when there are more than two groups. These test statistics are called F ratios, after Ronald Fisher; the critical values for the F test statistic are taken from the F probability distribution that Fisher derived.
Suppose that data for three groups are obtained. What can be said about the mean values of the three target populations? The F test is actually asking several questions simultaneously: is group 1 different from group 2; is group 2 different from group 3; and is group 1 different from group 3? As with the t test, the F test statistic is a ratio; in general terms, the numerator expresses the variability of the mean values of the three groups, whereas the denominator expresses the average variability or difference of each sample value from the mean of all sample values. The formulas to create the test statistic are computationally elegant but are rather hard to appreciate intuitively. The F statistic has two degrees of freedom, denoted m and n; the value of m is a function of the number
of experimental groups; the value for n is a function of the number of subjects in all experimental groups. The analysis of multigroup data is not necessarily finished after the ANOVAs are calculated. If the null hypothesis is rejected and it is accepted that there are differences among the groups tested, how can it be decided where the differences are? A variety of techniques are available to make what are called multiple comparisons after the ANOVA test is performed.
Robustness and Nonparametric Tests
Most statistical tests depend on certain assumptions about the nature of the distribution of values in the underlying populations from which experimental samples are taken. For the parametric statistics, that is, t tests and analysis of variance, it is assumed that the populations follow the normal distribution. However, for some data, experience or historical reasons suggests that these assumptions of a normal distribution do not hold; some examples include proportions, percentages, and response times. What should the experimenter do if he or she fears that the data are not normally distributed?
The experimenter might choose to ignore the problem of nonnormal data and inhomogeneity of variance, hoping that everything will work out. Such insouciance is actually a very practical and reasonable approach to the problem. Parametric statistics are called robust statistics; they stand up to much adversity. To a statistician, robustness implies that the magnitude of type I errors is not seriously affected by ill-conditioned data. Parametric statistics are sufficiently robust that the accuracy of decisions reached by means of t tests and analysis of variance remains very credible, even for moderately severe departures from the assumptions.
Another possibility would be to use statistics that do not require any assumptions about probability distributions of the populations. Such statistics are known as nonparametric tests;they can be used whenever there is very serious concern about the shape of the data. Nonparametric statistics are also the tests of choice for ordinal data. The basic concept behind nonparametric statistics is the ability to rank or order the observations; nonparametric tests are also called order statistics.
Most nonparametric statistics still require the use of theoretical probability distributions; the critical values that must be exceeded by the test statistic are taken from the binomial, normal, and chi-square distributions, depending on the nonparametric test being used. The nonparametric sign test, Mann-Whitney rank sum test, and Kruskal-Wallis one-way analysis of variance are analogous to the paired t test, unpaired t test, and one-way analysis of variance, respectively. The currently available nonparametric tests are not used more commonly because they do not adapt well to complex statistical models and because they are less able than parametric tests to distinguish between the null and alternative hypotheses if the data are, in fact, normally distributed.
Often the goal of an experiment is to predict the value of one characteristic from knowledge of another characteristic; the most commonly used technique for this purpose is regression analysis. Experiments for this purpose capture data pairs (x, y); these data should be displayed in a scatter plot. In the simplest type, a straight line (linear relationship) is assumed between two variables; one (y), the response or dependent variable, is considered a function of the other (x), the explanatory or independent variable. This is expressed as the linear regression equation y = a + bx; the parameters of the regression equation are a and b. The parameter b is the slope of the straight line relating x and y; for each 1 unit change in x, there is a b unit change in y. The parameter a is the intercept (value of y when x equals 0). Estimates of the parameters are obtained from a least squares method that sets the slope b value to minimize the distances from the data pairs to the
parameter of greatest interest in regression is usually the slope, especially whether the slope is nonzero; a zero valued slope implies that x and y are not related. A t test statistic is used to check the statistical significance of the slope.
While there is an additional assumption, the same (x, y) data pairs are usually subjected to correlation analysis. The correlation coefficient r is a measure of the covariation of x andy; r ranges from -1 to 1. There is no correlation for a zero valued r.
The test of the statistical significance of r is equivalent to the test for the significance of the regression slope b. The squared value of r or coefficient of determination (r2) has a very useful interpretation: the fraction of the variation of y explained by the variation of x.
Regression methods can be extended to data sets in which one response variable is thought to be linearly related to many explanatory variables; this is called multiple variable linear regression. This regression includes methods for choosing which of the explanatory variables have a statistically significant regression slope. Other extensions of regression include the typically sigmoidally shaped regression of a binary outcome (e.g., movement) versus anesthetic dose. There are multiple methods for regression of binary outcomes, the most common being logistic regression.
A researcher or reader should not be satisfied to see only the statistical results of regression and correlation. The statistician Anscombe4 created four hypothetical data sets to illustrate the importance of visual inspection of data. Each data set has 11 paired (x, y) observations (Fig. 9-1). For the data (x2, y2), the relationship between x and y is curvilinear; for (x4, y4), there is no relationship between x and y; for (x3, y3), there is a near perfect correlation between x and y except for one (x, y) pair. All regression and correlation values of the four data sets including means, SDs, slopes, intercepts, standard errors of regression parameters, statistical significance of regression parameters, and correlation coefficients are equal. Yet, these are clearly four different patterns that can only be detected by visual inspection. Even this simplest form of linear regression is based on the strong assumption of an underlaying linear relationship between x and y; failure of that assumption leads to erroneous statistical inference.
Systematic Reviews and Meta-Analyses
Reports using a new type of research method—the systematic review (SR) with an accompanying meta-analysis (MA)—have become commonplace over the last 25 years in anesthesia journals.5 (As of November 2007, a literature search for “(‘systematic review’ OR meta-analysis) AND anesthesia” in PubMed at the National Library of Medicine returned 334 citations of a total of all 36,026 citations for SRs or MAsa.) In systematic reviews, a focused question drives the research, for example, (1) Transient neurologic symptoms (TNS) following spinal
anaesthesia with lidocaine versus other local anaesthetics6 or (2) Ventilation with lower tidal volumes versus traditional tidal volumes in adults for acute lung injury and acute respiratory distress syndrome.7 These titles reveal some of the research design of a systematic review. There is a population of interest: (1) patients having spinal anesthesia and (2)adults (with) acute lung injury and acute respiratory distress syndrome. There is a comparison of two interventions: (1) lidocaine versus other local anaesthetics and (2) ventilation with lower tidal volumes versus traditional tidal volumes. There is an outcome for choosing success or failure of the interventions: (1) occurrence of TNS and (2) 28-day mortality(listed in text).
Figure 9-1. Four scatter plots from the Anscombe data sets.4 For each data set, n = 11, [x with bar above] = 9.00, SDx = 3.31, [Y with bar above] = 2.03 SDy = 2.03, y = 3.00 + 0.50x, SEa = 1.12, SEb = 0.12, r2 = 0.67, and so forth. All statistics are equal up to the fourth decimal place.
To answer the experimental question, data are obtained from controlled trials (usually randomized) already in the medical literature rather than from newly conducted clinical trials; the basic unit of analysis of this observational research is the published study. The researchers, also called the review authors, proceed through a structured protocol, which includes in part: (1) choice of study inclusion/exclusion criteria, (2) explicitly defined literature searching, (3) abstraction of data from included studies, (4) appraisal of data quality, (5) systematic pooling of data, and (6) discussion of inferences. This structured protocol is intended to minimize bias. Even randomized controlled trials may have sources of bias such as (1) selection bias: systematic differences between the patients receiving each intervention; (2) performance bias: systematic differences in care being given to study patients other than the preplanned interventions being evaluated; (3) attrition bias: systematic differences in the withdrawal of patients from each of the two intervention groups; and (4) detection bias: systematic differences in the ascertainment and recording of outcomes. The main focus of bias detection in the trials incorporated into a SR is (1) the randomization process, (2) the concealment of random allocation, (3) the use of blinding, and (4) the reporting/analysis of dropouts.8
Binary outcomes (yes/no, alive/dead, presence/absence) within a study are usually compared by the relative risk (rate ratio) statistic. If there is sufficient clinical similarity among the included studies, a summary relative risk of the overall effect of the comparison treatments is estimated by meta-analysis; meta-analysis is a set of statistical techniques for combining results from different studies.8 The calculations for the statistical analyses of a meta-analysis are unfamiliar to most, but are not difficult. The results of a meta-analysis are usually present in a figure called a forest plot (Fig. 9-2). The far left column identifies the included studies and the observed data. The horizontal lines and diamond shapes are graphical representations of individual study relative risk and summary relative risk, respectively; the far right column of the figure lists the relative risks with 95% CIs for the individual studies and the summary statistics. There are also descriptive and inferential statistics concerning the statistical heterogeneity of the meta-analysis and the significance of the summary statistics.
An examination of Figure 9-2 shows that many of the individual studies (11 of 14) had wide, nonsignificant confidence intervals that touch or cross the relative risk of identity (RR = 1). However, the overall relative risk calculated from all studies was 7.16 with a 95% CI [4.2, 12.75]. The power of summary statistics to combine evidence is clear. The review authors concluded: “Lidocaine can cause transient neurologic symptoms (TNS) in every seventh patient who receives spinal anesthesia. The relative risk of developing TNS is about seven times higher for lidocaine than for bupivacaine, prilocaine, and procaine. These painful symptoms disappear completely by the tenth postoperative day.”6
The production of SRs comes from several sources. Many come from the individual initiative of researchers who publish their results as stand-alone reports in the journals of medicine and anesthesia. The American Society of Anesthesiologists has
developed a process for the creation of practice parameters that includes among other things a variant form of SRs. The most prominent proponent of SRs is the Cochrane Collaboration, Oxford, United Kingdom. “The Cochrane Collaboration is an international not-for-profit and independent organization, dedicated to making up-to-date, accurate information about the effects of healthcare readily available worldwide. It produces and disseminates systematic reviews of healthcare interventions and promotes the search for evidence in the form of clinical trials and other studies of interventions. The Cochrane Collaboration was founded in 1993 and named after the British epidemiologist, Archie Cochrane.”b
Figure 9-2. Forest plot. (Modified from Graph 02/01 in Zaric D, Christiansen C, Pace NL, Punjasawadwong Y: Transient neurologic symptoms (TNS) following spinal anaesthesia with lidocaine versus other local anaesthetics (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK, John Wiley & Sons, Ltd., 2004. Copyright Cochrane Library, reproduced with permission.)
There are more than 50 collaborative review groups that provide the editorial control and supervision of SRs; one of these, located in Copenhagen,9 “… produce(s) and disseminate systematic reviews of healthcare interventions in anesthesia, perioperative medicine, intensive care medicine, emergency
medicine, prehospital medicine and resuscitation.”c The Cochrane Collaboration has extensive documentation and tutorials available electronically explaining the techniques of SRs and MA; for the creation of Cochrane SRs, software (titled RevMan) for the management of data and for the MA is freely available and downloadable from the Cochrane Collaboration Web site.
Interpretation of Results
Scientific studies do not end with the statistical test. The experimenter must submit an opinion as to the generalizability of his or her work to the rest of the world. Even if there is a statistically significant difference, the experimenter must decide if this difference is medically or physiologically important. Statistical significance does not always equate with biologic relevance. The questions an experimenter should ask about the interpretation of results are highly dependent on the specifics of the experiment. First, even small, clinically unimportant differences between groups can be detected if the sample size is sufficiently large. On the other hand, if the sample size is small, one must always worry that identified or unidentified confounding variables may explain any difference; as the sample size decreases, randomization is less successful in assuring homogenous groups. Second, if the experimental groups are given three or more doses of a drug, do the results suggest a steadily increasing or decreasing dose-response relationship? Suppose the observed effect for an intermediate dose is either much higher or much lower than that for both the highest and lowest dose; a dose-response relationship may exist, but some skepticism about the experimental methods is warranted. Third, for clinical studies comparing different drugs, devices, and operations on patient outcome, are the patients, clinical care, and studied therapies sufficiently similar to those provided at other locations to be of interest to a wide group of practitioners? This is the distinction between efficacy—does it work under the best (research) circumstances—and effectiveness—does it work under the typical circumstances of routine clinical care?
Finally, in comparing alternative therapies, the confidence that a claim for a superior therapy is true depends on the study design. The strength of the evidence concerning efficacy will be least for an anecdotal case report; next in importance will be a retrospective study, then a prospective series of patients compared with historical controls, and finally a randomized, controlled clinical trial. The greatest strength for a therapeutic claim is a series of randomized, controlled clinical trials confirming the same hypothesis. There is now considerable enthusiasm for the formal synthesis and combining of results from two or more trials in a systematic review.
Guidelines for Reading Journal Articles
Thousands of words are written each year in journal articles relevant to anesthesia. No one can read them all. How should the clinician determine which articles are useful? All that is possible is to learn to rapidly skip over most articles and concentrate on the few selected for their importance to the reader. Those few should be chosen according to their relevance and credibility. Relevance is determined by the specifics of one's anesthetic practice. Credibility is a function of the merits of the research methods, the experimental design, and the statistical analysis; the more proficient one's statistical skills, the more rapidly one can accept or reject the credibility of a research article.
Six easily remembered appraisal criteria for clinical studies can be fashioned from the words WHY, HOW, WHO, WHAT, HOW MANY, and SO WHAT: (1) WHY: Is the biologic hypothesis clearly stated? (2) HOW: What is the research design? (3) WHO: Is the target population clearly defined? (4) WHAT: How was the therapy administered and the data collected? (5) HOW MANY: Are the test statistics convincing? (6) SO WHAT: Is it clinically relevant to my patients? Although the statistical knowledge of most physicians is limited, these skills of critical appraisal of the literature can be learned and can tremendously increase the efficiency and benefit of journal reading.
Accompanying the exponential growth of medical information since World War II has been the creation of a wealth of biostatistical knowledge. Textbooks oriented toward medical statistics and with expositions of basic, intermediate, and advanced statistics abound.10,11,12,13,14,15 There are new journals of biomedical statistics, including Clinical Trials, Statistics in Medicine, and Statistical Methods in Medical Research, whose audiences are both statisticians and biomedical researchers. Some medical journals, for example, the British Medical Journal, regularly publish expositions of both basic and newer advanced statistical methods. Extensive Internet resources including electronic textbooks of basic statistical methods, online statistical calculators, standard data sets, reviews of statistical software, and so on can be easily found.
Statistics and Anesthesia
One intent of this chapter is to present the basic scope of support that the discipline of statistics can provide to anesthesia research. Journals of anesthesia now include many newer methods that have not been described. To mention just four: (1) studies of the pharmacokinetics and pharmacokinetics of a drug or a combination of drugs typically use linear mixed effects or generalized linear mixed effects models, (2) techniques of survival analysis are applied to hospital discharge times or postoperative morbidity/mortality outcomes, (3) methods of interim analysis or sequential trial design are used in randomized controlled trials to stop futile or dangerous treatments, and (4) propensity analysis reduces the possible biases in epidemiology research.
Although an intuitive understanding of certain basic principles is emphasized, these basic principles are not necessarily simple and have been developed by statisticians with great mathematical rigor. Academic anesthesia needs more workers to immerse themselves in these statistical fundamentals. Having done so, these statistically knowledgeable academic anesthesiologists will be prepared to improve their own research projects, to assist their colleagues in research, to efficiently seek consultation from the professional statistician, to strengthen the editorial review of journal articles, and to expound to the clinical reader the whys and wherefores of statistics. The clinical reader also needs to expend his or her own effort to acquire some basic statistical skills. Journals are increasingly difficult to understand without some basic statistical understanding. Some clinical problems can be best understood with a perspective based on probability. Finally, understanding principles of experimental design can prevent premature acceptances of new therapies from faulty studies.
1. Cobb LA, Thomas GI, Dillard DH, et al: An evaluation of internal-mammary-artery ligation by a double-blind technic. N Engl J Med 1959; 260: 1115
2. Sacks H, Chalmers TC, Smith HJ: Randomized versus historical controls for clinical trials. Am J Med 1982; 72: 233
3. Schulz KF, Chalmers I, Hayes RJ, et al: Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995; 273: 408
4. Anscombe FJ. Graphs in statistical analysis. Am Stat 1973; 27: 17
5. Carlisle JB. Systematic reviews: How they work and how to use them. Anaesthesia 2007; 62: 702
6. Zaric D, Christiansen C, Pace NL, et al: Transient neurologic symptoms (TNS) following spinal anaesthesia with lidocaine versus other local anaesthetics. Cochrane Database Syst Rev 2005, Oct 19; CD003006
7. Petrucci N, Iacovelli W: Lung protective ventilation strategy for the acute respiratory distress syndrome. Cochrane Database Syst Rev 2007, Jul 18; CD003844
8. Pace N: The meta-analysis of a systematic review, Evidence-Based Anaesthesia and Intensive Care. Edited by Møller A, Pedersen T, Cracknell J. New York, Cambridge University Press, 2006, pp 46
9. Pedersen T, Møller A: The Cochrane Collaboration and the Cochrane Anaesthesia Review Group, Evidence-Based Anaesthesia and Intensive Care. Edited by Møller A, Pedersen T, Cracknell J. New York, Cambridge University Press, 2006, pp 77
10. Altman DG, Trevor B, Gardner MJ, et al: Statistics with Confidence: Confidence Intervals and Statistical Guidelines. New York, John Wiley & Sons, 2000
11. Campbell MJ, Machin D: Medical Statistics: A Commonsense Approach. New York, John Wiley & Sons, 1999
12. Dawson B, Trapp RG, Trapp R: Basic & Clinical Biostatistics. New York, McGraw-Hill Medical, 2004
13. Riffenburgh RH: Statistics in Medicine. San Diego, Academic Press, 2005
14. Glantz SA: Primer of Biostatistics. New York, McGraw-Hill Medical, 2005
15. Rennie D, Guyatt G: Users' Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. Edited by Gordon Guyatt and Drummond Rennie. Chicago, IL: American Medical Association, 2002
Editors: Barash, Paul G.; Cullen, Bruce F.; Stoelting, Robert K.; Cahalan, Michael K.; Stock, M. Christine