AAOS Comprehensive Orthopaedic Review

Section 2 - General Knowledge

Chapter 23. Statistics: Practical Applications for Orthopaedics

I. Presentation of Study Results

A. Terminology


1. Absolute risk increase—The difference in the absolute risk (percentage or proportion of patients with an outcome) between exposed and unexposed, or experimental and control group patients; typically used with regard to harmful exposure.


2. Absolute risk reduction—The difference in the absolute risk (percentage or proportion of patients with an outcome) between exposed (experimental event rate) and unexposed (control event rate); used only with regard to a beneficial exposure or intervention.


3. Bayesian analysis—An analysis that starts with a particular probability of an event (the prior probability) and incorporates new information to generate a revised probability (a posterior probability).


4. Blind (or blinded or masked)—Type of study/assessment in which the participant of interest is unaware of whether patients have been assigned to the experimental group or the control group. Patients, clinicians, those monitoring outcomes, judicial assessors of outcomes, data analysts, and those writing the paper all can be blinded or masked. To avoid confusion, the term masked is preferred in studies in which vision loss of patients is an outcome of interest.


5. Dichotomous outcome—A yes or no outcome (ie, an outcome that either happens or does not happen), such as reoperation, infection, or death.


6. Dichotomous variable—A variable that can take one of two values, such as male or female, dead or alive, infection present or not present.


7. Effect size—The difference in outcome between the intervention group and the control group divided by some measure of the variability, typically the standard deviation.


8. Hawthorne effect—Human behavior that is changed when participants are aware that their behavior is being observed. In a therapeutic study, the Hawthorne effect might affect results such that the treatment is deemed effective when it actually is ineffective. In a diagnostic study, the Hawthorne effect might be responsible if the patient does not have the target condition but the test suggests the patient does.


9. Intention-to-treat principle, or intention-to-treat analysis—Analyzing patient outcomes based on the group into which they were randomized, regardless of whether the patient actually received the planned intervention. This type of analysis preserves the power of randomization so that important unknown factors that influence outcome are likely to be distributed equally in each comparison group.


10. Meta-analysis—An overview that incorporates a quantitative strategy for combining the results of multiple studies into a single pooled or summary estimate.


11. Null hypothesis—In the hypothesis-testing framework, a null hypothesis is the starting hypothesis the statistical test is designed to consider and, possibly, reject.


12. Number needed to harm—The number of patients who would need to be treated over a specific period of time before one adverse side effect of the treatment would be expected to occur. The number needed to harm is the inverse of the absolute risk increase.


13. Number needed to treat—The number of patients who would need to be treated over a specific period of time to prevent one bad outcome. When discussing number needed to treat, it is important to specify the treatment, its duration, and the bad outcome being prevented. The number needed to treat is the inverse of the absolute risk reduction.


14. Odds—The ratio of probability of occurrence to probability of nonoccurrence of an event.



Figure 1. Presentation of study results. A hypothetical example of a study evaluating infection rates in 200 patients with a treatment and control group is presented. A 2 × 2 table is constructed, and multiple approaches to describing the results are presented.]

15. Odds ratio—The ratio of the odds of an event occurring in an exposed group to the odds of the same event occurring in a group that is not exposed.


16. Relative risk—The ratio of the risk of an event occurring among an exposed population to the risk of it occurring among the unexposed.


17. Relative risk reduction—An estimate of the proportion of baseline risk that is removed by the therapy; calculated by dividing the absolute risk reduction by the absolute risk in the control group.


18. Reliability—Refers to consistency or reproducibility of data.


19. Treatment effect—The results of comparative clinical studies can be expressed using various treatment effect measures. Examples are absolute risk reduction, relative risk reduction, odds ratio, number needed to treat, and effect size. The appropriate measure to use to express a treatment effect and the appropriate calculation method to use—whether probabilities, means, or medians—depends on the type of outcome variable used to measure health outcomes. For example, relative risk reduction is used for dichotomous variables, whereas effect sizes are normally used for continuous variables.


20. Continuous variable—A variable with a potentially infinite number of possible values. Examples include range of motion, blood pressure.


21. Categorical variable—A variable with possible values in several different categories. An example would be types of fractures.


B. Figure 1 illustrates a typical presentation of study results.


C. Bias in research


1. Definition—Bias is a systematic tendency to produce an outcome that differs from the underlying truth.


a. Channeling effect, or channeling bias—The tendency of clinicians to prescribe treatment based on a patient's prognosis. As a result of the behavior, comparisons between treated and untreated patients will yield a biased estimate of treatment effect.


b. Data completeness bias—May occur when an information system (eg, a hospital database) is used to enter data directly for the treatment group but manual data entry is used for the control group.


c. Detection bias, or surveillance bias—The tendency to look more carefully for an outcome in one of two or more groups being compared.


d. Incorporation bias—When investigators study a diagnostic test that incorporates features of the target outcome.


e. Interviewer bias—Greater probing or any subjectivity that affects how the interview is conducted by an interviewer in one of two or more groups being compared.


f. Publication bias—When the publication of research results depends on the direction of the study results and whether they are statistically significant.


g. Recall bias—When patients who experience an adverse outcome have a different likelihood of recalling an exposure than do the patients who do not have an adverse outcome, independent of the true extent of exposure.


h. Surveillance bias—See detection bias, above.


i. Verification bias—When the results of a diagnostic test influence whether patients are assigned to a treatment group.


2. Limiting bias—In a clinical research study, bias can be limited by randomization, concealment of treatment allocation, and blinding.


a. Random allocation (randomization)


i. Random allocation is the allocation of individuals to groups by chance, usually by using a table of random numbers. A sample derived by selecting sampling units (eg, individual patients) such that each unit has an independent and fixed (generally equal) chance of selection.


ii. Random allocation should not be confused with systematic allocation (eg, on even and odd days of the month) or allocation at the convenience or discretion of the investigator.


b. Concealment of treatment allocation


i. Allocation is concealed when individuals cannot determine the treatment allocation of the next patient to be enrolled in a clinical trial, such as when remote telephone- or Internet-based randomization systems are used to assign treatment.


ii. Use of even/odd days or hospital chart numbers to allocate patients to a treatment group is not considered concealed allocation.


c. Blinding (see definition in I.A.4).

II. Basic Statistical Concepts

A. A statistician should be consulted when a study or analysis of a study is planned.


B. Hypothesis testing


1. Null hypothesis


a. The investigator starts with a null hypothesis that the statistical test is designed to consider and, possibly, disprove. Typically, the null hypothesis is that there is no difference between treatments being compared. Therefore, the investigator starts with the assumption that the treatments are equally effective and adheres to this position unless data make it untenable.


b. In a randomized trial in which investigators compare an experimental treatment with a placebo control, the null hypothesis can be stated as follows: "The true difference in effect on the outcome of interest between the experimental and control treatments is zero."


C. Errors in hypothesis testing—Any comparative study can have one of four possible outcomes (

Figure 2).


[Figure 2. Errors in hypothesis testing. A 2 × 2 table is used to depict the results of a study comparing two treatments (difference, no difference) and the "truth" (whether or not there is a difference in actuality). Common errors are presented, including type I and II errors. (Reproduced with permission from Bhandari M, Devereaux PJ, Swiontkowski M, et al: Internal fixation compared with arthroplasty for displaced fractures of the femoral neck. J Bone Joint Surg Am 2003;85: 1673-1681.)]

1. A true positive result (the study correctly identifies a true difference between treatments)


2. A true negative result (the study correctly identifies no difference between treatments)


3. A false-negative result, called a type II (β) error (the study incorrectly concludes no difference between treatments when a difference really exists). By convention, the error rate is set at 0.20 (20% false-negative rate). Study power (see section D, below) is derived from the 1 - β error rate (1 - 0.2 = 0.80, or 80%).


4. A false-positive result, called a type I (α) error (the study incorrectly concludes a difference between treatments exists when the effects are really due to chance). By convention, most studies in orthopaedics adopt an α-error rate of 0.05. Thus, investigators can expect a false-positive error about 5% of the time.


D. Study power—The ability of a study to detect the difference between two interventions if one in fact exists. The power of a statistical test is typically a function of the magnitude of the treatment effect, the designated type I (α) and type II (β) error rates, and the sample size n.


E. P value—The P value is defined as the probability, under the assumption of no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed if the experiment were repeated over and over. The setting of the P-value threshold for significance has been set arbitrarily at 0.05 by convention. The meaning of a statistically significant result, therefore, is one that is sufficiently unlikely to be due to chance alone that the investigator is ready to reject the null hypothesis.


F. Confidence interval—Range defined by two values within which it is probable (to a specified percentage)



Table 1. Common Statistical Tests]

   that the true value lies for the whole population of patients from whom the study patients were selected. Various percentages can be used, but by convention, the 95% confidence interval is typically reported in clinical research.

III. Basic Statistical Inference

A. Normal distribution (Table 1)


1. Definition—A normal distribution is a distribution of continuous data that forms a bell-shaped curve; ie, one with many values near the mean and progressively fewer values toward the extremes.


2. Several statistical tests assume a normal distribution. If a sample is not normally distributed, a separate set of statistical tests should be applied. These tests are referred to as nonparametric tests because they do not rely on parameters such as the mean and standard deviation.


B. Descriptive statistics


1. Measures of central tendency


a. Mean—The sample mean is equal to the sum of the measurements divided by the number of observations.


b. Median—The median of a set of measurements is the number that falls in the middle.


c. Mode—The mode is the most frequently occurring number in a set of measurements.


d. Use of measures of central tendency


i. Continuous variables (such as blood pressure or body weight) can be summarized with a mean if the data is normally distributed.


ii. If the data are not normally distributed, the median may be a better summary statistic.


iii. Categorical variables (eg, pain grade [0,1, 2,3,4,5]) can be summarized with a median.


2. Measures of spread


a. Standard deviation—Derived from the square root of the sample variance. The variance is calculated as the average of the squares of the deviations of the measurements about their mean.


b. Range—Reflects the smallest and largest reported values.


C. Comparing means


1. Comparing two means


a. When two independent samples of normally distributed continuous data are compared, the t-test (often called Student, leading to the common attribution Student's t-test) is used.


b. When the data are non-normally distributed, a nonparametric test such as the Mann-Whitney test or Wilcoxon rank sum test can be used. When the means are paired, such as left and right knees, a paired t-test is most appropriate. The nonparametric correlate of this test is the Wilcoxon signed rank test.


2. Comparing three or more means—When three or more different means are compared (eg, hospital stay among patients treated for tibial fracture with plate fixation, intramedullary nail, and external fixation), the test of choice is single-factor analysis of variance.


D. Comparing proportions


1. Independent proportions—The chi-square (χ2) test is a simple method of comparing two proportions, such as a difference in infection rates (%) between two groups. A Yates correction factor is sometimes used to account for small sample sizes, but when measured values are very small (eg, less than 5 events in any of the treatment or control groups), the χ2 test is unreliable and the Fisher exact test is the test of choice.


2. Paired proportions—When proportions are paired (eg, before and after study on the same patients), a McNemar test is used to examine differences between groups.


E. Regression and correlation


1. Regression analysis—Used to predict (or estimate) the association between a response variable (dependent variable) and a series of known explanatory (independent) variables.


a. Simple regression—When a single independent variable is used.


b. Multiple regression—When multiple independent variables are used.


c. Logistic regression—When the response variable is dichotomous (yes or no; infection present or not).


d. Cox proportional hazards regression—Used in survival analysis to assess the relationship between two or more variables and a single dependent variable (the time to an event).


2. Correlation—The strength of the relationship between two variables (eg, age versus hospital stay in patients with ankle fractures) can be summarized in a single number, the correlation coefficient, denoted by the letter r. The correlation coefficient can range from -1.0 to 1.0.


a. A correlation coefficient of -1.0 represents the strongest possible negative relationship, in which the person who scores the highest on one variable scores the lowest on the other variable.


b. A correlation coefficient of 1.0 represents the strongest possible positive relationship.


c. A correlation coefficient of 0 denotes no relationship between the two variables.


d. The Pearson correlation r is used to assess the relationship between two normally distributed continuous variables. If one of the variables is not normally distributed, the Spearman correlation is the better option.


F. Survival analysis


1. Time-to-event analysis involves estimating the probability that an event will occur at various time points.


2. Survival analysis estimates the probability of survival as a function of time from a discrete start point (time of injury, time of operation).


3. Survival curves, also called Kaplan-Meier curves, are often used to report the survival of one or more groups of patients over time.

IV. Determining Sample Size for a Comparative Study

A. Difference in means—The anticipated sample size for this continuous outcome measure is determined by the following equation:


   where Zα = type I error, Zβ = type II error, σ = standard deviation, and δ = mean difference between groups.


B. Difference in proportions—For dichotomous variables, the following sample size calculation is used:


   where PA, PB = % successes in A and B, and f(α,β) = function of type I and II errors.

V. Reliability

A. Test-retest reliability measures the extent to which the same observer rating a subject on multiple occasions achieves similar results. Because time elapses between ratings, the characteristics being rated may also change. For example, the range of motion of a hip may change substantially over a 4-week period.


B. Intra-observer reliability is the same as test-retest reliability except that the characteristics being rated are fixed. Because time is the only factor that varies between administrations, this form of study design will typically yield a higher reliability estimate than test-retest or inter-observer reliability studies.


C. Inter-observer reliability measures the extent to which two or more observers obtain similar scores when rating the same subject. Inter-observer reliability is the broadest and—when error related to observers is highly relevant—the most clinically useful measure of reliability.


1. The κ coefficient, or κ statistic, the most commonly reported statistic in orthopaedic fracture reliability studies, can be thought of as a measure of agreement beyond chance.


2. The κ coefficient has a maximum value of 1.0 (indicating perfect agreement). κ = 0.0 indicates no agreement beyond chance; negative values indicate agreement worse than chance.


3. The κ coefficient can be used when data are categorical (categories of answers such as definitely healed, possibly healed, or not healed) or binary (a yes or no answer such as infection or no infection).


D. Intraclass correlation coefficients (ICCs) are a set of related measures of reliability that yield a value that is closest to the formal definition of reliability. One ICC measures the proportion of total variability that is due to true between-subject variability. ICCs are used when data are continuous.

VI. Diagnostic Tests

A. Definition of terms


1. Specificity—The proportion of individuals who are truly free of a designated disorder who are so identified by the test.


2. Sensitivity—The proportion of individuals who truly have a designated disorder who are so identified by the test.


3. Positive predictive value—The proportion of individuals with a positive test who have the disease.


4. Negative predictive value—The proportion of individuals with a negative test who are free of the disease.


5. Likelihood ratios—For a screening or diagnostic test (including clinical signs or symptoms), the likelihood ratio expresses the relative likelihood that a given test result would be expected in a patient with (as opposed to one without) a disorder of interest.


6. Accuracy—For a screening or diagnostic test, its accuracy is its overall ability to identify patients with disease (true positives) and without disease (true negatives) in the study population.



Figure 3. Diagnostic tests. A 2 × 2 table depicts C-reactive protein thresholds for diagnosing infection. Several test characteristics are presented, including sensitivity, specificity, and likelihood ratios.]

B. Figure 3 illustrates the application of these concepts to C-reactive protein thresholds.

Top Testing Facts

1. Bias in clinical research is best defined as a systematic deviation from the truth.


2. Randomization, concealment of allocation, and blinding are key methodologic principles to limit bias in clinical research.


3. Type II (β) error occurs when a study concludes there is no difference between treatments when in fact a difference really exists.


4. The power of a study is its ability to find a difference between treatments when a true difference exists.


5. The P value is defined as the probability, under the assumption of no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.


6. A 95% confidence interval is the interval within which the true estimate of effect lies 95% of the time.


7. Two means can be compared with a Student's t-test.


8. Two proportions can be compared statistically with a chi-square (χ2) test.


9. The specificity of a test is the proportion of individuals who are free of the disorder who are so identified by the test.


10. The sensitivity of a test is the proportion of individuals who have a designated disorder who are so identified by the test.


Bhandari M, Guyatt GH, Montori V, Swiontkowski MF: User's Guide to the Orthopaedic Literature IV: How to use an article about a diagnostic test. J Bone Joint Surg Am 2003;85: 1133-1140.

Bhandari M, Guyatt GH, Swiontkowski MF: User's guide to the orthopaedic literature I: How to use an article about a surgical therapy. J Bone Joint Surg Am 2001;83:916-927.

Dorrey F, Swiontkowski MF: Statistical tests: What they tell us and what they don't. Ad Orthop Surg 1997;21:81-85.

Griffin D, Audige L: Common statistical methods in orthopaedic clinical studies. Clin Orthop Relat Res 2003;413:70-79.

Guyatt GH, Jaeschke R, Heddle N, Cook DJ, Shannon H, Walter SD: Basic statistics for clinicians: Hypothesis testing. Can Med Assoc J 1995;152:27-32.

Guyatt GH, Jaeschke R, Heddle N, Cook DJ, Shannon H, Walter SD: Basic statistics for clinicians: Interpreting study results and confidence intervals. Can Med Assoc J 1995;152: 169-173.

Guyatt GH, Rennie D (eds): User's Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. Chicago, IL, American Medical Association Press, 2001.

Moher D, Dulberg CS, Wells GA: Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994;272:122-124.