Thompson & Thompson Genetics in Medicine, 8th Edition

CHAPTER 10. Identifying the Genetic Basis for Human Disease

This chapter provides an overview of how geneticists study families and populations to identify genetic contributions to disease. Whether a disease is inherited in a recognizable mendelian pattern, as illustrated in Chapter 7, or just occurs at a higher frequency in relatives of affected individuals, as explored in Chapter 8, it is the different genetic and genomic variants carried by affected family members or affected individuals in the population that either cause disease directly or influence their susceptibility to disease. Genome research has provided geneticists with a catalogue of all known human genes, knowledge of their location and structure, and an ever-growing list of tens of millions of variants in DNA sequence found among individuals in different populations. As we saw in previous chapters, some of these variants are common, others are rare, and still others differ in frequency among different ethnic groups. Whereas some variants clearly have functional consequences, others are certainly neutral. For most, their significance for human health and disease is unknown.

In Chapter 4, we dealt with the effect of mutation, which alters one or more genes or loci to generate variant alleles and polymorphisms. And in Chapters 7 and 8, we examined the role of genetic factors in the pathogenesis of various mendelian or complex disorders. In this chapter, we discuss how geneticists go about discovering the particular genes implicated in disease and the variants they contain that underlie or contribute to human diseases, focusing on three approaches.

• The first approach, linkage analysis, is family-based. Linkage analysis takes explicit advantage of family pedigrees to follow the inheritance of a disease among family members and to test for consistent, repeated coinheritance of the disease with a particular genomic region or even with a specific variant or variants, whenever the disease is passed on in a family.

• The second approach, association analysis, is population-based. Association analysis does not depend explicitly on pedigrees but instead takes advantage of the entire history of a population to look for increased or decreased frequency of a particular allele or set of alleles in a sample of affected individuals taken from the population, compared with a control set of unaffected people from that same population. It is particularly useful for complex diseases that do not show a mendelian inheritance pattern.

• The third approach involves direct genome sequencing of affected individuals and their parents and/or other individuals in the family or population. This approach is particularly useful for rare mendelian disorders in which linkage analysis is not possible because there are simply not enough such families to do linkage analysis or because the disorder is a genetic lethal that always results from new mutations and is never inherited. In these situations, sequencing the genome (or just the coding exons of every gene, the exome) of an affected individual and sifting through the resulting billions (or in the case of the exome, tens of millions) of bases of DNA has been successfully used to find the gene responsible for the disorder. This new approach takes advantage of recently developed technology that has reduced the cost of DNA sequencing a millionfold from what it was when the original reference genome was being prepared during the Human Genome Project.

Use of linkage, association, and sequencing to map and identify disease genes has had an enormous impact on our understanding of the pathogenesis and pathophysiology of many diseases. In time, knowledge of the genetic contributions to disease will also suggest new methods of prevention, management, and treatment.

Genetic Basis for Linkage Analysis and Association

A fundamental feature of human biology is that each generation reproduces by combining haploid gametes containing 23 chromosomes, resulting from independent assortment and recombination of homologous chromosomes (see Chapter 2). To understand fully the concepts underlying genetic linkage analysis and tests for association, it is necessary to review briefly the behavior of chromosomes and genes during meiosis as they are passed from one generation to the next. Some of this information repeats the classic material on gametogenesis presented in Chapter 2, illustrating it with new information that has become available as a result of the Human Genome Project and its applications to the study of human variation.

Independent Assortment and Homologous Recombination in Meiosis

During meiosis I, homologous chromosomes line up in pairs along the meiotic spindle. The paternal and maternal homologues exchange homologous segments by crossing over and creating new chromosomes that are a “patchwork” consisting of alternating portions of the grandmother's chromosomes and the grandfather's chromosomes (see Fig. 2-15). In the family illustrated in Figure 10-1, examples of recombined chromosomes are shown in the offspring (generation II) of the couple in generation I. Also shown is that the individual in generation III inherits a maternal chromosome that contains segments derived from all four of his maternal grandparents' chromosomes. The creation of such patchwork chromosomes emphasizes the notion of human genetic individuality: each chromosome inherited by a child from a parent is never exactly the same as either of the two copies of that chromosome in the parent.

image

FIGURE 10-1 The effect of recombination on the origin of various portions of a chromosome. Because of crossing over in meiosis, the copy of the chromosome the boy (generation III) inherited from his mother is a mosaic of segments of all four of his grandparents' copies of that chromosome.

Although any two homologous chromosomes generally look identical under the microscope, they differ substantially at the DNA sequence level. As discussed in Chapter 4, these differences at the same position (locus) on a pair of homologous chromosomes are alleles. Alleles that are common (generally considered to be those carried by approximately 2% or more of the population) constitute a polymorphism,and linkage analysis in families (as we will explore later in the chapter) requires following the inheritance of specific alleles as they are passed down in a family. Allelic variants on homologous chromosomes allow geneticists to trace each segment of a chromosome inherited by a particular child to determine if and where recombination events have occurred along the homologous chromosomes. Several tens of millions of genetic markers are available to serve as genetic markers for this purpose. It is a truism now in human genetics to say that it is essentially always possible to determine with confidence, through a series of analyses outlined in this chapter, whether a given allele or segment of the genome in a patient has been inherited from his or her father or mother. This advance—a singular product of the Human Genome Project—is an essential feature of genetic analysis to determine the precise genetic basis of disease.

Alleles at Loci on Different Chromosomes Assort Independently

Assume there are two polymorphic loci, 1 and 2, on different chromosomes, with alleles A and a at locus 1 and alleles B and b at locus 2 (Fig. 10-2). Suppose an individual's genotype at these loci is Aa and Bb; that is, she is heterozygous at both loci, with alleles A and B inherited from her father and alleles a and b inherited from her mother. The two different chromosomes will line up on the metaphase plate at meiosis I in one of two combinations with equal likelihood. After recombination and chromosomal segregation are complete, there will be four possible combinations of alleles, AB, ab, Ab, and aB, in a gamete; each combination is as likely to occur as any other, a phenomenon known as independent assortment. Because AB gametes contain only her paternally derived alleles, and ab gametes only her maternally derived alleles, these gametes are designated parental. In contrast, Ab or aB gametes, each containing one paternally derived allele and one maternally derived allele, are termed nonparental gametes. On average, half (50%) of gametes will be parental (AB or ab) and 50% nonparental (Ab or aB).

image

FIGURE 10-2 Independent assortment of alleles at two loci, 1 and 2, when they are located on different chromosomes. Assume that alleles A and B were inherited from one parent, a and b from the other. The two chromosomes can line up on the metaphase plate in meiosis I in one of two equally likely combinations, resulting in independent assortment of the alleles on these two chromosomes.

Alleles at Loci on the Same Chromosome Assort Independently If at Least One Crossover between Them Always Occurs

Now suppose that an individual is heterozygous at two loci 1 and 2, with alleles A and B paternally derived and a and b maternally derived, but the loci are on the same chromosome (Fig. 10-3). Genes that reside on the same chromosome are said to be syntenic (literally, “on the same thread”), regardless of how close together or how far apart they lie on that chromosome.

image

FIGURE 10-3 Crossing over between homologous chromosomes (black horizontal lines) in meiosis is shown between chromatids of two homologous chromosomes on the left. Crossovers result in new combinations of maternally and paternally derived alleles on the recombinant chromosomes present in gametes, shown on the right. If no crossing over occurs in the interval between loci 1 and 2, only parental (nonrecombinant) allele combinations, AB and ab, occur in the offspring. If one or two crossovers occur in the interval between the loci, half the gametes will contain a nonrecombinant combination of alleles and half the recombinant combination. The same is true if more than two crossovers occur between the loci (not illustrated here). NR, Nonrecombinant; R, recombinant.

How will these alleles behave during meiosis? We know that between one and four crossovers occur between homologous chromosomes during meiosis I when there are two chromatids per homologous chromosome. If no crossing over occurs within the segment of the chromatids between the loci 1 and 2 (and ignoring whatever happens in segments outside the interval between these loci), then the chromosomes we see in the gametes will be AB and ab, which are the same as the original parental chromosomes; a parental chromosome is therefore a nonrecombinant chromosome. If crossing over occurs at least once in the segment between the loci, the resulting chromatids may be either nonrecombinant or Ab and aB, which are not the same as the parental chromosomes; such a nonparental chromosome is therefore a recombinantchromosome (shown in Fig. 10-3). One, two, or more recombinations occurring between two loci at the four-chromatid stage result in gametes that are 50% nonrecombinant (parental) and 50% recombinant (nonparental), which is precisely the same proportions one sees with independent assortment of alleles at loci on different chromosomes. Thus, if two syntenic loci are sufficiently far apart on the same chromosome to ensure that there is going to be at least one crossover between them in every meiosis, the ratio of recombinant to nonrecombinant genotypes will be, on average, 1 : 1, just as if the loci were on separate chromosomes and assorting independently.

Recombination Frequency and Map Distance

Frequency of Recombination as a Measure of Distance between Loci

Suppose now that two loci are on the same chromosome but are either far apart, very close together, or somewhere in between (Fig. 10-4). As we just saw, when the loci are far apart (see Fig. 10-4A), at least one crossover will occur in the segment of the chromosome between loci 1 and 2, and there will be gametes of both the nonrecombinant genotypes AB and ab and recombinant genotypes Ab and aB, in equal proportions (on average) in the offspring. On the other hand, if two loci are so close together on the same chromosome that crossovers never occur between them, there will be no recombination; the nonrecombinant genotypes (parental chromosomes AB and ab in Fig. 10-4B) are transmitted together all of the time, and the frequency of the recombinant genotypes Ab and aB will be 0. In between these two extremes is the situation in which two loci are far enough apart that one recombination between the loci occurs in some meioses but not in others (see Fig. 10-4C). In this situation, we observe nonrecombinant combinations of alleles in the offspring when no crossover occurred and recombinant combinations when a recombination has occurred, but the frequency of recombinant chromosomes at the two loci will fall between 0% and 50%. The crucial point is that the closer together two loci are, the smaller the recombination frequency, and the fewer recombinant genotypes are seen in the offspring.

image

FIGURE 10-4 Assortment of alleles at two loci, 1 and 2, when they are located on the same chromosome. A, The loci are far apart and at least one crossover between them is likely to occur in every meiosis. B, The loci are so close together that crossing over between them is not observed, regardless of the presence of crossovers elsewhere on the chromosome. C, The loci are close together on the same chromosome but far enough apart that crossing over occurs in the interval between the two loci only in some meioses but not in most others.

Detecting Recombination Events Requires Heterozygosity and Knowledge of Phase

Detecting the recombination events between loci requires that (1) a parent be heterozygous (informative) at both loci and (2) we know which allele at locus 1 is on the same chromosome as which allele at locus 2. In an individual who is heterozygous at two syntenic loci, one with alleles A and a, the other B and b, which allele at the first locus is on the same chromosome with which allele at the second locus defines what is referred to as the phase (Fig. 10-5). The set of alleles on the same homologue (A and B, or a and b) are said to be in coupling (or cis) and form what is referred to as a haplotype (see Chapters 7 and 8). In contrast, alleles on the different homologues (A and b, or a and B) are in repulsion (or trans) (see Fig. 10-5).

image

FIGURE 10-5 Possible phases of alleles A and a and alleles B and b.

Figure 10-6 shows a pedigree of a family with multiple individuals affected by autosomal dominant retinitis pigmentosa (RP), a degenerative disease of the retina that causes progressive blindness in association with abnormal retinal pigmentation. As shown, individual I-1 is heterozygous at both marker locus 1 (with alleles A and a) and marker locus 2 (with alleles B and b), as well as heterozygous for the disorder (D is the dominant disease allele, d is the recessive normal allele). The alleles A-D-B form one haplotype, and a-d-b the other. Because we know her spouse is homozygous at all three loci and can only pass on the ab,and d alleles, we can easily determine which alleles the children received from their mother and thus trace the inheritance of her RP-causing allele or her normal allele at that locus, as well as the alleles at both marker loci in her children. Close inspection of Figure 10-6 allows one to determine whether each child has inherited a recombinant or a nonrecombinant haplotype from the mother.

image

FIGURE 10-6 Coinheritance of the gene for an autosomal dominant form of retinitis pigmentosa (RP), with marker locus 2 and not with marker locus 1. Only the mother's contribution to the children's genotypes is shown. The mother (I-1) is affected with this dominant disease and is heterozygous at the RP9 locus (Dd) as well as at loci 1 and 2. She carries the A and B alleles on the same chromosome as the mutant RP9 allele (D). The unaffected father is homozygous normal (dd) at the RP9 locus as well as at the two marker loci (AA and bb); his contributions to his offspring are not considered further. Two of the three affected offspring have inherited the B allele at locus 2 from their mother, whereas individual II-3 inherited the b allele. The five unaffected offspring have also inherited the b allele. Thus seven of eight offspring are nonrecombinant between the RP9locus and locus 2. However, individuals II-2, II-4, II-6, and II-8 are recombinant for RP9 and locus 1, indicating that meiotic crossover has occurred between these two loci.

However, if the mother (I-1) had been homozygous bb at locus 2, then all children would inherit a maternal b allele, regardless of whether they received a mutant D or normal d allele at the RP9 locus. Because she is not informative at locus 2 in this scenario, it would be impossible to determine whether recombination had occurred. Similarly, if the information provided for the family in Figure 10-6 was simply that individual I-1 was heterozygous, Bb, at locus 2 and heterozygous for an autosomal dominant form of RP, but the phase was not known, one could not determine which of her children were nonrecombinant between the RP9 locus and locus 2 and which of her children were recombinant. Thus determination of who is or is not a recombinant requires that we know whether the B or b allele at locus 2 was on the same chromosome as the mutant D allele for RP in individual I-1 (see Fig. 10-6).

Linkage and Recombination Frequency

Linkage is the term used to describe a departure from the independent assortment of two loci, or, in other words, the tendency for alleles at loci that are close together on the same chromosome to be transmitted together, as an intact unit, through meiosis. Analysis of linkage depends on determining the frequency of recombination as a measure of how close two loci are to each other on a chromosome. A common notation for recombination frequency (as a proportion, not a percentage) is the Greek letter theta, θ, where θ varies from 0 (no recombination at all) to 0.5 (independent assortment). If two loci are so close together that θ = 0 between them (as in Fig. 10-4B), they are said to be completely linked; if they are so far apart that θ = 0.5 (as in Fig. 10-4A), they are assorting independently and are unlinked. In between these two extremes are various degrees of linkage.

Genetic Maps and Physical Maps

The map distance between two loci is a theoretical concept that is based on actual data—the extent of observed recombination, θ, between the loci. Map distance is measured in units called centimorgans (cM), defined as the genetic length over which, on average, one crossover occurs in 1% of meioses. (The centimorgan is image of a “morgan,” named after Thomas Hunt Morgan, who first observed genetic recombination in the fruit fly Drosophila.) Therefore a recombination fraction of 1% (i.e., θ = 0.01) translates approximately into a map distance of 1 cM. As we discussed before in this chapter, the recombination frequency between two loci increases proportionately with the distance between two loci only up to a point because, once markers are far enough apart that at least one recombination will always occur, the observed recombination frequency will equal 50% (θ = 0.5), no matter how far apart physically the two loci are.

To accurately measure true genetic map distance between two widely spaced loci, therefore, one has to use markers spaced at short genetic distances (1 cM or less) in the interval between these two loci, and then add up the values of θ between the intervening markers, because the values of θ between pairs of closely neighboring markers will be good approximations of the genetic distances between them. Using this approach, the genetic length of an entire human genome has been measured and, interestingly, found to differ between the sexes. When measured in female meiosis, genetic length of the human genome is approximately 60% greater (≈4596 cM) than when it is measured in male meiosis (2868 cM), and this sex difference is consistent and uniform across each autosome. The sex-averaged genetic length of the entire haploid human genome, which is estimated to contain approximately 3.3 billion base pairs of DNA, or ≈3300 Mb (see Chapter 2), is 3790 cM, for an average of approximately 1.15 cM/Mb. The reason for the observed increased recombination per unit length of DNA in females compared with males is unknown, although one might speculate that it has to do with the increased opportunity for crossing over afforded by the many years that female gamete precursors remain in meiosis I before ovulation (see Chapter 2).

Pairwise measurements of recombination between genetic markers separated by 1 Mb or more gives a fairly constant ratio of genetic distance to physical distance of approximately 1 cM/Mb. However, when recombination is measured at much higher resolution, such as between markers spaced less than 100 kb apart, recombination per unit length becomes nonuniform and can range over four orders of magnitude (0.01 to 100 cM/Mb). When viewed on the scale of a few tens of kilobase pairs of DNA, the apparent linear relationship between physical distance in base pairs and recombination between polymorphic markers located millions of base pairs of DNA apart is, in fact, the result of an averaging of so-called hot spots of recombination interspersed among regions of little or no recombination. Hot spots occupy only approximately 6% of sequence in the genome and yet account for approximately 60% of all the meiotic recombination in the human genome. The biological basis for these recombination hot spots is unknown. The impact of this nonuniformity of recombination at high resolution is discussed next, as we address the phenomenon of linkage disequilibrium.

Linkage Disequilibrium

It is generally the case that the alleles at two loci will not show any preferred phase in the population if the loci are linked but at a distance of 0.1 to 1 cM or more. For example, suppose loci 1 and 2 are 1 cM apart. Suppose further that allele A is present on 50% of the chromosomes in a population and allele a on the other 50% of chromosomes, whereas at locus 2, a disease susceptibility allele S is present on 10% of chromosomes and the protective allele s is on 90% (Fig. 10-7). Because the frequency of the A-S haplotype, freq(A-S), is simply the product of the frequencies of the two alleles—freq(A) × freq(S) = 0.5 × 0.1 = 0.05, the alleles are said to be in linkage equilibrium (see Fig. 10-7A). That is, the frequencies of the four possible haplotypes, A-S, A-s, a-S, and a-s follow directly from the allele frequencies of AaS, and s.

image

FIGURE 10-7 Tables demonstrating how the same allele frequencies can result in different haplotype frequencies indicative of linkage equilibrium, strong linkage disequilibrium, or partial linkage disequilibrium. A, Under linkage equilibrium, haplotype frequencies are as expected from the product of the relevant allele frequencies. B, Loci 1 and 2 are located very close to one another, and alleles at these loci show strong linkage disequilibrium. Haplotype A-S is absent and a-s is less frequent (0.4 instead of 0.45) compared to what is expected from allele frequencies. C, Alleles at loci 1 and 2 show partial linkage disequilibrium. Haplotypes, A-S and a-s are underrepresented compared to what is expected from allele frequencies. Note that the allele frequencies for A and a at locus 1 and for S and s at locus 2 are the same in all three tables; it is the way the alleles are distributed in haplotypes, shown in the central four cells of the table, that differ.

However, as we examine haplotypes involving loci that are very close together, we find that knowing the allele frequencies for these loci individually does not allow us to predict the four haplotype frequencies. The frequency of any one of the haplotypes, freq(A-S) for example, may not be equal to the product of the frequencies of the individual alleles that make up that haplotype; in this situation, freq(A-S) ≠ freq(A) × freq(S), and the alleles are thus said to be in linkage disequilibrium (LD). The deviation (“delta”) between the expected and actual haplotype frequencies is called D and is given by:

image

D ≠ 0 is equivalent to saying the alleles are in LD, whereas D = 0 means the alleles are in linkage equilibrium.

Examples of LD are illustrated in Figures 10-7B and 10-7C. Suppose one discovers that all chromosomes carrying allele S also have allele a, whereas none has allele A (see Fig. 10-7B). Then allele S and allele aare said to be in complete LD. As a second example, suppose the A-S haplotype is present on only 1% of chromosomes in the population (see Fig. 10-7C). The A-S haplotype has a frequency much below what one would expect on the basis of the frequencies of alleles A and S in the population as a whole, and D < 0, whereas the haplotype a-S has a frequency much greater than expected and D > 0. In other words, chromosomes carrying the susceptibility allele S are enriched for allele a at the expense of allele A, compared with chromosomes that carry the protective allele s. Note, however, that the individual allele frequencies are unchanged; it is only how they are distributed into haplotypes that differs, and this is what determines if there is LD.

Linkage Disequilibrium Has Both Biological and Historical Causes

What causes LD? When a disease allele first enters the population (by mutation or by immigration of a founder who carries the disease allele), the particular set of alleles at polymorphic loci linked to (i.e., syntenic with) the disease locus constitutes a disease-containing haplotype in which the disease allele is located (Fig. 10-8). The degree to which this original disease-containing haplotype will persist over time depends in part on the probability that recombination moves the disease allele off of the original haplotype and onto chromosomes with different sets of alleles at these linked loci. The speed with which recombination will move the disease allele onto a new haplotype depends on a number of factors:

• The number of generations (and therefore the number of opportunities for recombination) since the mutation first appeared.

• The frequency of recombination per generation between the loci. The smaller the value of θ, the greater is the chance that the disease-containing haplotype will persist intact.

• Processes of natural selection for or against particular haplotypes. If a haplotype combination undergoes either positive selection (and therefore is preferentially passed on) or experiences negative selection (and therefore is less readily passed on), it will be either overrepresented or underrepresented in that population.

image

FIGURE 10-8 A, With each generation, meiotic recombination exchanges the alleles that were initially present at polymorphic loci on a chromosome on which a disease-associated mutation arose (image) for other alleles present on the homologous chromosome. Over many generations, the only alleles that remain in coupling phase with the mutation are those at loci so close to the mutant locus that recombination between the loci is very rare. These alleles are in linkage disequilibrium with the mutation and constitute a disease-associated haplotype. B, Affected individuals in the current generation (arrows) carry the mutation (X) in linkage disequilibrium with the disease-associated haplotype (individuals in blue). Depending on the age of the mutation and other population genetic factors, a disease-associated haplotype ordinarily spans a region of DNA of a few kb to a few hundred kb. SeeSources & Acknowledgments.

Measuring Linkage Disequilibrium

Although conceptually valuable, the discrepancy, D, between the expected and observed frequencies of haplotypes is not a good way to quantify LD because it varies not only with degree of LD but also with the allele frequencies themselves. To quantify varying degrees of LD, therefore, geneticists often use a measure derived from D, referred to as D′ (see Box). D′ is designed to vary from 0, indicating linkage equilibrium, to a maximum of ±1, indicating very strong LD. Because LD is a result not only of genetic distance but also of the amount of time during which recombination had a chance to occur and the possible effects of selection for or against particular haplotypes, different populations living in different environments and with different histories can have different values of D′ between the same two alleles at the same loci in the genome.

image

image

and F is a correction factor that helps account for the allele frequencies.

The value of F depends on whether D itself is a positive or negative number.

image

image

Clusters of Alleles Form Blocks Defined by Linkage Disequilibrium

Analysis of pairwise measurements of D′ for neighboring variants, particularly single nucleotide polymorphism (SNPs), across the genome reveals a complex genetic architecture for LD. Contiguous SNPs can be grouped into clusters of varying size in which the SNPs in any one cluster show high levels of LD with each other but not with SNPs outside that cluster (Fig. 10-9). For example, the nine polymorphic loci in cluster 1 (see Fig. 10-9A), each consisting of two alleles, have the potential to generate 29 = 512 different haplotypes; yet, only five haplotypes constitute 98% of all haplotypes seen. The absolute values of |D′| between SNPs within the cluster are well above 0.8. Clusters of loci with alleles in high LD across segments of only a few kilobase pairs to a few dozen kilobase pairs are termed LD blocks.

image

FIGURE 10-9 A, A 145-kb region of chromosome 4 containing 14 single nucleotide polymorphism (SNPs). In cluster 1, containing SNPs 1 through 9, five of the 29 = 512 theoretically possible haplotypes are responsible for 98% of all the haplotypes in the population, reflecting substantial linkage disequilibrium (LD) among these SNP loci. Similarly, in cluster 2, only three of the 24 = 16 theoretically possible haplotypes involving SNPs 11 to 14 represent 99% of all the haplotypes found. In contrast, alleles at SNP 10 are found in linkage equilibrium with the SNPs in cluster 1 and cluster 2. B, A schematic diagram in which each red box contains the pairwise measurement of the degree of LD between two SNPs (e.g., the arrow points to the box, outlined in black, containing the value of D′ for SNPs 2 and 7). The higher the degree of LD, the darker the color in the box, with maximum D′ values of 1.0 occurring when there is complete LD. Two LD blocks are detectable, the first containing SNPs 1 through 9, and the second SNPs 11 through 14. Between blocks, the 14-kb region containing SNP 10 shows no LD with neighboring SNPs 9 or 11 or with any of the other SNP loci. C, A graph of the ratio of map distance to physical distance (cM/Mb), showing that a recombination hot spot is present in the region between SNP 10 and cluster 2, with values of recombination that are fifty- to sixtyfold above the average of approximately 1.15 cM/Mb for the genome. SeeSources & Acknowledgments.

The size of an LD block encompassing alleles at a particular set of polymorphic loci is not identical in all populations. African populations have smaller blocks, averaging 7.3 kb per block across the genome, compared with 16.3 kb in Europeans; Chinese and Japanese block sizes are comparable to each other and are intermediate, averaging 13.2 kb. This difference in block size is almost certainly the result of the smaller number of generations since the founding of the non-African populations compared with populations in Africa, thereby limiting the time in which there has been opportunity for recombination to break up regions of LD.

Is there a biological basis for LD blocks, or are they simply genetic phenomena reflecting human (and genome) history? It appears that biology does contribute to LD block structure in that the boundaries between LD blocks often coincide with meiotic recombination hot spots, discussed earlier (see Fig. 10-9C). Such recombination hot spots would break up any haplotypes spanning them into two shorter haplotypes more rapidly than average, resulting in linkage equilibrium between SNPs on one side and the other side of the hot spot. The correlation is by no means exact, and many apparent boundaries between LD blocks are not located over evident recombination hot spots. This lack of perfect correlation should not be surprising, given what we have already surmised about LD: it is affected not only by how likely a recombination event is (i.e., where the hot spots are) but also by the age of the population, the frequency of the haplotypes originally present in the founding members of that population, and whether there has been either positive or negative selection for particular haplotypes.

Mapping Human Disease Genes

Why Map Disease Genes?

In clinical medicine, a disease state is defined by a collection of phenotypic findings seen in a patient or group of patients. Designating such a disease as “genetic”—and thus inferring the existence of a gene responsible for or contributing to the disease—comes from detailed genetic analysis, applying the principles outlined in Chapters 7 and 8. However, surmising the existence of a gene or genes in such a way does not tell us which of the perhaps 40,000 to 50,000 coding and noncoding genes in the genome is involved, what the function of that gene or genes might be, or how that gene or genes cause or contribute to the disease.

Disease gene mapping is often a critical first step in identifying the gene or genes in which variants are responsible for causing or increasing susceptibility to disease. Mapping the gene focuses attention on a region of the genome in which to carry out a systematic analysis of all the genes in that region to find the mutations or variants that contribute to the disease. Once the gene is identified that harbors the DNA variants responsible for either causing a mendelian disorder or increasing susceptibility to a genetically complex disease, the full spectrum of variation in that gene can be studied. In this way, we can determine the degree of allelic heterogeneity, the penetrance of different alleles, whether there is a correlation between certain alleles and various aspects of the phenotype (genotype-phenotype correlation), and the frequency of disease-causing or predisposing variants in various populations.

Other patients with the same or similar disorders can be examined to see whether or not they also harbor mutations in the same gene, which would indicate there is locus heterogeneity for a particular disorder. Once the gene and variants in that gene are identified in affected individuals, highly specific methods of diagnosis, including prenatal diagnosis, and carrier screening can be offered to patients and their families.

The variants associated with disease can then be modeled in other organisms, which allows us to use powerful genetic, biochemical, and physiological tools to better understand how the disease comes about. Finally, armed with an understanding of gene function and how the alleles associated with disease affect that function, we can begin to develop specific therapies, including gene replacement therapy, to prevent or ameliorate the disorder. Indeed, much of the material in the next few chapters about the etiology, pathogenesis, mechanism, and treatment of various diseases begins with gene mapping. Here, we examine the major approaches used to discover genes involved in genetic disease, as outlined at the beginning of this chapter.

Mapping Human Disease Genes by Linkage Analysis

Determining Whether Two Loci Are Linked

Linkage analysis is a method of mapping genes that uses studies of recombination in families to determine whether two genes show linkage when passed on from one generation to the next. We use information from the known or suspected mendelian inheritance pattern (dominant, recessive, X-linked) to determine which of the individuals in a family have inherited a recombinant or a nonrecombinant chromosome.

To decide whether two loci are linked and, if so, how close or far apart they are, we rely on two pieces of information. First, using the family data in hand, we need to estimate θ, the recombination frequency between the two loci, because that will tell us how close or far apart they are. Next, we need to ascertain whether θ is statistically significantly different from 0.5, because determining whether two loci are linked is equivalent to asking whether the recombination fraction between them differs significantly from the 0.5 fraction expected for unlinked loci. Estimating θ and, at the same time, determining the statistical significance of any deviation of θ from 0.5, relies on a statistical tool called the likelihood ratio (as discussed later in the Chapter).

Linkage analysis begins with a set of actual family data with N individuals. Based on a mendelian inheritance model, count the number of chromosomes, r, that show recombination between the allele causing the disease and alleles at various polymorphic loci around the genome (so-called “markers”). The number of chromosomes that do not show a recombination is therefore N − r. The recombination fraction θ can be considered to be the unknown probability, with each meiosis, that a recombination will occur between the two loci; the probability that no recombination occurs is therefore 1 − θ. Because each meiosis is an independent event, one multiplies the probability of a recombination, θ, or of no recombination, (1 − θ), for each chromosome. The formula for the likelihood (which is just the probability) of observing this number of recombinant and nonrecombinant chromosomes when θ is unknown is therefore given by {N!/r!(N − r)!}θr (1 − θ)(N−r). (The factorial term, N!/r!(N − r)!, is necessary to account for all the possible birth orders in which the recombinant and nonrecombinant children can appear in the pedigree). Calculate a second likelihood based on the null hypothesis that the two loci are unlinked, that is, make θ = 0.50. The ratio of the likelihood of the family data supporting linkage with unknown θ to the likelihood that the loci are unlinked is the odds in favor of linkage and is given by:

image

Fortunately, the factorial terms are always the same in the numerator and denominator of the likelihood ratio, and therefore they cancel each other out and can be ignored. If θ = 0.5, the numerator and denominator are the same and the odds equal 1.

Statistical theory tells us that when the value of the likelihood ratio for all values of θ between 0 and 0.5 are calculated, the value of θ that gives the greatest value of this likelihood ratio is, in fact, the best estimate of the recombination fraction you can make given the data and is referred to as θmax. By convention, the computed likelihood ratio for different values of θ is usually expressed as the log10 and is called the LOD score (Z) where LOD stands for “Logarithm of the ODds.” The use of logarithms allows likelihood ratios calculated from different families to be combined by simple addition instead of having to multiply them together.

How is LOD score analysis actually carried out in families with mendelian disorders? (See Box this page) Return to the family shown in Figure 10-6, in which the mother has an autosomal dominant form of retinitis pigmentosa. There are dozens of different forms of this disease, many of which have been mapped to specific sites within the genome and the genes for which have now been identified. Typically, when a new family comes to clinical attention, one does not know which form of RP a patient has. In this family, the mother is also heterozygous for two marker loci on chromosome 7, locus 1 in distal 7q and locus 2 in 7p14. Suppose we know (from other family data) that the disease allele D is in coupling with allele A at locus 1 and allele B at locus 2. Given this phase, one can see that there has been recombination between RP and locus 2 in only one of her eight children, her daughter II-3. The alleles at the disease locus, however, show no tendency to follow the alleles at locus 1 or alleles at any of the other hundreds of marker loci tested on the other autosomes. Thus, although the RP locus involved in this family could in principle have mapped anywhere in the human genome, one now begins to suspect on the basis of the linkage data that the responsible RP locus lies in the region of chromosome 7 near marker locus 2.

To provide a quantitative assessment of this suspicion, suppose we let θ be the “true” recombination fraction between RP and locus 2, the fraction we would see if we had unlimited numbers of offspring to test. The likelihood ratio for this family is therefore

image

and reaches a maximum LOD score of Zmax = 1.1 at θmax = 0.125.

The value of θ that maximizes the likelihood ratio, θmax, may be the best estimate one can make for θ given the data, but how good an estimate is it? The magnitude of the LOD score provides an assessment of how good an estimate of θmax you have made. By convention, a LOD score of +3 or greater (equivalent to greater than 1000 : 1 odds in favor of linkage) is considered firm evidence that two loci are linkedthat is, that θmax is statistically significantly different from 0.5. In our RP example, image of the offspring are nonrecombinant and image are recombinant. The θmax = 0.125, but the LOD score is only 1.1, enough to raise a suspicion of linkage but insufficient to prove linkage because Zmax falls far short of 3.

Linkage Analysis of Mendelian Diseases

Linkage analysis is used when there is a particular mode of inheritance (autosomal dominant, autosomal recessive, or X-linked) that explains the inheritance pattern.

LOD score analysis allows mapping of genes in which mutations cause diseases that follow mendelian inheritance.

The LOD score gives both:

• A best estimate of the recombination frequency, θmax, between a marker locus and the disease locus; and

• An assessment of how strong the evidence is for linkage at that value of θmax. Values of the LOD score Z above 3 are considered strong evidence.

Linkage at a particular θmax of a disease gene locus to a marker with known physical location implies that the disease gene locus must be near the marker. The smaller the θmax is, the closer the disease locus is to the linked marker locus.

Combining LOD Score Information across Families

In the same way that each meiosis in a family that produces a nonrecombinant or recombinant offspring is an independent event, so too are the meioses that occur in different families. We can therefore multiply the likelihoods in the numerators and denominators of each family's likelihood odds ratio together. Suppose two additional families with RP were studied and one showed no recombination between locus 2 and RP in four children and the other showed no recombination in five children. The individual LOD scores can be generated for each family and added together (Table 10-1). Because the maximum LOD score Zmaxexceeds 3 at θmax = ≈0.06, the RP gene in this group of families is linked to locus 2 at a recombination distance of ≈0.06. Because the genomic location of marker locus 2 is known to be at 7p14, the RP in this family can be mapped to the 7p14 region and likely involves the RP9 gene, one of the already identified loci for a form of autosomal dominant RP.

TABLE 10-1

LOD Score for Three Families with Retinitis Pigmentosa

image

Individual Zmax for each family is shown in bold. The overall Zmax = 3.47 at θmax = 0.06.

If, however, some of the families being used for the study were to have RP due to mutations at a different locus, the LOD scores between families would diverge, with some showing a trend to being positive at small values of θ and others showing strongly negative LOD scores at these values. Thus, in linkage analysis involving more than one family, unsuspected locus heterogeneity can obscure what may be real evidence for linkage in a subset of families.

Phase-Known and Phase-Unknown Pedigrees

In the RP example just discussed, we assumed that we knew the phase of marker alleles on chromosome 7 in the affected mother in that family. Let us now look at the implications of knowing phase in more detail.

Consider the three-generation family with autosomal dominant neurofibromatosis, type 1 (NF1) (Case 34) in Figure 10-10. The affected mother, II-2, is heterozygous at both the NF1 locus (D/d) and a marker locus (A/a), but (as shown in Fig. 10-10A) we have no genotype information on her parents. The two affected children received the A alleles along with the D disease allele, and the one unaffected child received the aallele along with the normal d allele. Without knowing the phase of these alleles in the mother, either all three offspring are recombinants or all three are nonrecombinants. Because both possibilities are equally likely in the absence of any other information, we consider the phase on her two chromosomes to be D-a and d-A half of the time and D-A and d-a the other half (which assumes the alleles in these haplotypes are in linkage equilibrium). To calculate the overall likelihood of this pedigree, we then add the likelihood calculated assuming one phase in the mother to the likelihood calculated assuming the other phase. Therefore, the overall likelihood = image and the likelihood ratio for this pedigree, then, is:

image

giving a maximum LOD score of Zmax= 0.602 at θmax = 0.

image

FIGURE 10-10 Two pedigrees of autosomal dominant neurofibromatosis, type 1 (NF1). A, Phase of the disease allele D and marker alleles A and a in individual II-2 is unknown. B, Availability of genotype information for generation I allows a determination that the disease allele D and marker allele A are in coupling in individual II-2. NR, Non-recombinant; R, recombinant.

If, however, additional genotype information in the maternal grandfather I-1 becomes available (as in Fig. 10-10B), the phase can now be determined to be D-A (i.e., the NF1 allele D was in coupling with the Ain individual II-2). In light of this new information, the three children can now be scored definitively as nonrecombinants, and we no longer have to consider the possibility of the opposite phase. The numerator of the likelihood ratio now becomes (1 − θ)30) and the maximum LOD score Zmax = 0.903 at θmax = 0. Thus knowing the phase increases the power of the data available to test for linkage.

Mapping Human Disease Genes by Association

Designing an Association Study

An entirely different approach to identification of the genetic contribution to disease relies on finding particular alleles that are associated with the disease in a sample from the population. In contrast to linkage analysis, this approach does not depend upon there being a mendelian inheritance pattern and is therefore better suited for discovering the genetic contributions to disorders with complex inheritance (see Chapter 8). The presence of a particular allele at a locus at increased or decreased frequency in affected individuals compared with controls is known as a disease association. There are two commonly used study designs for association studies:

• Case-control studies. Individuals with the disease are selected in a population, a matching group of controls without disease are then selected, and the genotypes of individuals in the two groups are determined and used to populate a two-by-two table (see below).

• Cross-sectional or cohort studies. A random sample of the entire population is chosen and then analyzed for whether they have (cross-sectional) or, after being followed over time, develop (cohort) a particular disease; the genotypes of everyone in the study population are determined. The numbers of individuals with and without disease and with and without an allele (or genotype or haplotype) of interest are used to fill out the cells of a two-by-two table.

Odds Ratios and Relative Risks

The two different types of association studies report the strength of the association, using either the odds ratio or relative risk.

In a case-control study, the frequency of a particular allele or haplotype (e.g., for a human leukocyte antigen [HLA] haplotype or a particular SNP allele or SNP haplotype) is compared between the selected affected and unaffected individuals, and an association between disease and genotype is then calculated by an odds ratio (OR).

image

*A genetic marker can be an allele, a genotype, or a haplotype.

Using the two-by-two table, the odds of an allele carrier developing the disease is the ratio (a/b) of the number of allele carriers who develop the disease (a) to the number of allele carriers who do not develop the disease (b). Similarly, the odds of a noncarrier developing the disease is the ratio (c/d) of noncarriers who develop the disease (c) divided by the number of noncarriers who do not develop the disease (d). The disease odds ratio is then the ratio of these odds.

image

An OR that differs from 1 means there is an association of disease risk with the genetic marker, whereas OR = 1 means there is no association.

Alternatively, if the association study was designed as a cross-sectional or cohort study, the strength of an association can be measured by the relative risk (RR). The RR is the ratio of the proportion of those with the disease who carry a particular allele ([a/(a + b)]) to the proportion of those without the disease who carry that allele ([c/(c + d)]).

image

Again, an RR that differs from 1 means there is an association of disease risk with the genetic marker, whereas RR = 1 means there is no association. (The relative risk RR introduced here should not be confused with λr, the risk ratio in relatives, which was discussed in Chapter 8. λr is the prevalence of a particular disease phenotype in an affected individual's relatives versus that in the general population.)

For diseases that are rare (i.e., a < b and c < d), a case-control design with calculation of the OR is best, because any random sample of a population is unlikely to contain sufficient numbers of affected individuals to be suitable for a cross-sectional or cohort study design. Note, however, that when a disease is rare and calculating an OR in a case-control study is the only practical approach, OR is a good approximation for an RR. (Examine the formula for RR and convince yourself that, when a < b and c < d, (a + b) ≈ b and (c + d) ≈ d, and thus RR ≈ OR.)

The information obtained in an association study comes in two parts. The first is the magnitude of the association itself: the further the RR or OR diverges from 1, the greater is the effect of the genetic variant on the association. However, an OR or RR for an association is a statistical measure and requires a test of statistical significance. The significance of any association can be assessed by simply asking with a chi-square test if the frequencies of the allele (a, b, c, and d in the two-by-two table) differ significantly from what would be expected if there were no association (i.e., if the OR or RR were equal to 1.0). A common way of expressing whether there is statistical significance to an estimate of OR or RR is to provide a 95% (or 99%) confidence interval. The confidence interval is the range within which one would expect the OR or RR to fall 95% (or 99%) of the time by chance alone in a sample taken from the population. If a confidence interval excludes the value 1.0, then the OR or RR deviates significantly from what would be expected if there were no association with the marker locus being tested, and the null hypothesis of no association can be rejected at the corresponding significance level. (Later in this chapter we will explain why a level of 0.05 or 0.01 is inadequate for assessing statistical significance when multiple marker loci in the genome are tested simultaneously for association.)

To illustrate these approaches, we first consider a case-control study of cerebral vein thrombosis (CVT), which we introduced in Chapter 8. In this study, suppose a group of 120 patients with CVT and 120 matched controls were genotyped for the 20210G>A allele in the prothrombin gene (see Chapter 8).

image

CVT, Cerebral vein thrombosis.

Because this is a case-control study, we will calculate an odds ratio: OR = (23/4)/(97/116) = ≈6.9 with 95% confidence limits of 2.3 to 20.6. There is clearly a substantial effect size of 6.9 and 95% confidence limits that exclude 1.0, thereby demonstrating that there is a strong and statistically significant association between the 20210G>A allele and CVT. Stated simply, individuals carrying the prothrombin 20210G>A allele have nearly seven times greater odds of having the disease than those who do not carry this allele.

To illustrate a longitudinal cohort study in which RR, instead of an OR, can be calculated, consider statin-induced myopathy, a rare but well-recognized adverse drug reaction that can develop in some individuals during statin therapy to lower cholesterol. In one study, subjects enrolled in a cardiac protection study were randomized to receive 40 mg of the statin drug simvastatin or placebo. Over 16,600 participants exposed to the statin were genotyped for a variant (Val174Ala) in the SLCO1B1 gene, which encodes a hepatic drug transporter, and were watched for development of the adverse drug response. Out of the entire genotyped group exposed to the statin, 21 developed myopathy. Examination of their genotypes showed that the RR for developing myopathy associated with the presence of the Val174Ala allele is approximately 2.6, with 95% confidence limits of 1.3 to 5.1. Thus here there is a statistically significant association between the Val174Ala allele and statin-induced myopathy; those carrying this allele are at moderately increased risk for developing this adverse drug reaction relative to those who do not carry this allele.

One common misconception concerning an association study is that the more significant the P value, the stronger is the association. In fact, a significant P value for an association does not provide information concerning the magnitude of the effect of an associated allele on disease susceptibility. Significance is a statistical measure that describes how likely it is that the population sample used for the association study could have yielded an observed OR or RR that differs from 1.0 simply by chance alone. In contrast, the actual magnitude of the OR or RR—how far it diverges from 1.0—is a measure of the impact a particular variant (or genotype or haplotype) has on increasing or decreasing disease susceptibility.

Genome-Wide Association Studies

The Haplotype Map (HapMap)

For many years, association studies for human disease genes were limited to particular sets of variants in restricted sets of genes chosen either for convenience or because they were thought to be involved in a pathophysiological pathway relevant to a disease and thus appeared to be logical candidate genes for the disease under investigation. Thus many such association studies were undertaken before the Human Genome Project era with use of the HLA or blood group loci, for example, because these loci were highly polymorphic and easily genotyped in case-control studies. Ideally, however, one would like to be able to test systematically for an association between any disease of interest and every one of the tens of millions of rare and common alleles in the genome in an unbiased fashion without any preconception of what genes and genetic variants might be contributing to the disease.

Association analyses on a genome scale are referred to as genome-wide association studies, known by their acronym GWAS. Such an undertaking for all known variants is impractical for many reasons but can be approximated by genotyping cases and controls for a mere 300,000 to 1 million individual variants located throughout the genome to search for association with the disease or trait in question. The success of this approach depends on exploiting LD because, as long as a variant responsible for altering disease susceptibility is in LD with one or more of the genotyped variants within an LD block, a positive association should be detectable between that disease and the alleles in the LD block.

Developing such a set of markers led to the launch of the Haplotype Mapping (HapMapProject, one of the biggest human genomics efforts to follow completion of the Human Genome Project. The HapMap Project began in four geographically distinct groups—a primarily European population, a West African population, a Han Chinese population, and a population from Japan—and included collecting and characterizing millions of SNP loci and developing methods to genotype them rapidly and inexpensively. Since that time, whole-genome sequencing has been applied to many populations in what is referred to as the 1000 Genomes Project, resulting in a massive expansion in the database of DNA variants available for GWAS with different populations around the globe.

Gene Mapping by Genome-Wide Association Studies in Complex Traits

The purpose of the HapMap was not just to gather basic information about the distribution of LD across the human genome. Its primary purpose was to provide a powerful new tool for finding the genetic variants that contribute to human disease and other traits by making possible an approximation to an idealized, full-scale, genome-wide association. The driving principle behind this approach is a straightforward one: detecting an association with alleles within an LD block pinpoints the genomic region within the block as being likely to contain the disease-associated allele. Consequently, although the approach does not typically pinpoint the actual variant that is responsible functionally for the association with disease, this region will be the place to focus additional studies to find the allelic variant that is functionally involved in the disease process itself.

Historically, detailed analysis of conditions associated with high-density variants in the class I and class II HLA regions (see Fig. 8-10) have exemplified this approach (see Box). However, with the tens of millions of variants now available in different populations, this approach can be broadened to examine the genetic basis of virtually any complex disease or trait. Indeed, to date, thousands of GWAS have uncovered an enormous number of naturally occurring variants associated with a variety of genetically complex multifactorial diseases, ranging from diabetes and inflammatory bowel disease to rheumatoid arthritis and cancer, as well as for traits such as stature and pigmentation. Research to uncover the underlying biological basis for these associations will be ongoing for years to come.

Human Leukocyte Antigen and Disease Association

Among more than a thousand genome-trait or genome-disease associations from around the genome, the region with the highest concentration of associations to different phenotypes is the human leukocyte antigen (HLA) region. In addition to the association of specific alleles and haplotypes to type 1 diabetes discussed in Chapter 8, association of various HLA polymorphisms has been demonstrated for a wide range of conditions, most but not all of which are autoimmune, that is, associated with an abnormal immune response apparently directed against one or more self-antigens. These associations are thought to be related to variation in the immune response resulting from polymorphism in immune response genes.

The functional basis of most HLA-disease associations is unknown. HLA molecules are integral to T-cell recognition of antigens. Different polymorphic HLA alleles are thought to result in structural variation in these cell surface molecules, leading to differences in the capacity of the proteins to interact with antigen and the T-cell receptor in the initiation of an immune response, thereby affecting such critical processes as immunity against infections and self-tolerance to prevent autoimmunity.

Ankylosing spondylitis, a chronic inflammatory disease of the spine and sacroiliac joints, is one example. More than 95% of those with ankylosing spondylitis are HLA-B27-positive; the risk for developing ankylosing spondylitis is at least 150 times higher for people who have certain HLA-B27 alleles than for those who do not. These alleles lead to HLA-B27 heavy chain misfolding and inefficient antigen presentation.

In other disorders, the association between a particular HLA allele or haplotype and a disease is not due to functional differences in immune response genes themselves. Instead, the association is due to a particular allele being present at a very high frequency on chromosomes that also happen to contain disease-causing mutations in another gene within the major histocompatibility complex region. One example is hemochromatosis, a common disorder of iron overload. More than 80% of patients with hemochromatosis are homozygous for a common mutation, Cys282Tyr, in the hemochromatosis gene (HFE) and have HLA-A*0301 alleles at their HLA-A locus. The association is not the result of HLA-A*0301, however. HFE is involved with iron transport or metabolism in the intestine; HLA-A, as a class I immune response gene, has no effect on iron transport. The association is due to proximity of the two loci and LD between the Cys282Tyr HFE mutation and the A*0301 allele at HLA-A.

Pitfalls in Design and Analysis of GWAS

Association methods are powerful tools for pinpointing precisely the genes that contribute to genetic disease by demonstrating not only the genes but also the particular alleles responsible. They are also relatively easy to perform because one needs samples only from a set of unrelated affected individuals and controls and does not have to carry out laborious family studies and collection of samples from many members of a pedigree.

Association studies must be interpreted with caution, however. One serious limitation of association studies is the problem of totally artifactual association caused by population stratification (see Chapter 9). If a population is stratified into separate subpopulations (e.g., by ethnicity or religion) and members of one subpopulation rarely mate with members of other subpopulations, then a disease that happens to be more common in one subpopulation for whatever reason can appear (incorrectly) to be associated with any alleles that also happen to be more common in that subpopulation than in the population as a whole. Factitious association due to population stratification can be minimized, however, by careful selection of matched controls. In particular, one form of quality control is to make sure the cases and controls have similar frequencies of alleles whose frequencies are known to differ markedly between populations (ancestry informative markers, as we discussed in Chapter 9). If the frequencies seen in cases and controls are similar, then unsuspected or cryptic stratification is unlikely.

In addition to the problem of stratification producing false-positive associations, false-positive results in GWAS can also arise if an inappropriately lax test for statistical significance is applied. This is because as the number of alleles being tested for a disease association increases, the chance of finding associations by chance alone also increases, a concept in statistics known as the problem of multiple hypothesis testing. To understand why the cut-off for statistical significance must be much more stringent when multiple hypotheses are being tested, imagine flipping a coin 50 times and having it come up heads 40 times. Such a highly unusual result has a probability of occurring of only once in approximately 100,000 times. However, if the same experiment were repeated a million times, chances are greater than 99.999% that at least one coin flip experiment out of the million performed will result in 40 or more heads! Thus even rare events that occur by chance alone in an experiment become frequent when the experiment is repeated over and over again. This is why when testing for an association with hundreds of thousands to millions of variants across the genome, tens of thousands of variants could appear associated with P < 0.05 by chance alone, making a typical cutoff for statistical significance of P < 0.05 far too low to point to a true association. Instead, a significance level of P < 5 × 10−8 is considered to be more appropriate for GWAS that tests hundreds of thousands to millions of variants. Even with appropriately stringent cutoffs for genome-wide significance, however, false-positive results due to chance alone will still occur. To take this into account, a properly performed GWAS usually include a replication study in a different, completely independent group of individuals to show that alleles near the same locus are associated. A caveat, however, is that alleles that show association may be different in different ethnic groups.

Finally, it is important to emphasize that if an association is found between a disease and a polymorphic marker allele that is part of a dense haplotype map, one cannot infer there is a functional role for that marker allele in increasing disease susceptibility. Because of the nature of LD, all alleles in LD with an allele at a locus involved in the disease will show an apparently positive association, whether or not they have any functional relevance in disease predisposition. An association based on LD is still quite useful, however, because in order for the polymorphic marker alleles to appear associated, the associated polymorphic marker alleles would likely sit within an LD block that also harbors the actual disease locus.

A comparison of the characteristics, strengths, and weaknesses of linkage and association methods for disease gene mapping are summarized in the Box.

Comparison of Linkage and Association Methods

image

From Gene Mapping to Gene Identification

The application of gene mapping to medical genetics using the approaches outlined in the previous section has met with many spectacular successes. This strategy has led to the identification of the genes associated with thousands of mendelian disorders and a growing number of genes and alleles associated with genetically complex disorders. The power of these approaches has increased enormously with the introduction of highly efficient and less expensive technologies for genome analysis.

In this section, we describe how genetic and genomic methods led to the identification of the genes involved in two disorders, one first using linkage analysis and LD to narrow down the location of the gene responsible for the common autosomal recessive disease cystic fibrosis (CF(Case 12) and one using GWAS to find multiple allelic variants in genes that increase susceptibility to age-related macular degeneration (AMD) (Case 3), a devastating disorder that robs older adults of their vision.

Gene Finding in a Common Mendelian Disorder by Linkage Mapping

Example: Cystic Fibrosis

Because of its relatively high frequency, particularly in white populations, and the nearly total lack of understanding of the abnormalities underlying its pathogenesis, CF represented a prime candidate for identifying the gene responsible by using linkage to find the gene's location, rather than using any information on the disease process itself. DNA samples from nearly 50 CF families were analyzed for linkage between CF and hundreds of DNA markers throughout the genome until linkage of CF to markers on the long arm of chromosome 7 was finally identified. Linkage to additional DNA markers in 7q31-q32 narrowed the localization of the CF gene to an approximately 500-kb region of chromosome 7.

Linkage Disequilibrium in Cystic Fibrosis.

At this point, however, an important feature of CF genetics emerged: even though the closest linked markers were still some distance from the CF gene, it became clear that there was significant LD between the disease locus and a particular haplotype at loci tightly linked to the disease. Regions with the greatest degree of LD were analyzed for gene sequences, leading to the isolation of the CF gene in 1989. As described in detail in Chapter 12, the gene responsible, which was named the cystic fibrosis transmembrane conductance regulator (CFTR), showed an interesting spectrum of mutations. A 3-bp deletion (ΔF508) that removed a phenylalanine at position 508 in the protein was found in approximately 70% of all mutant CF alleles in northern European populations but never among normal alleles at this locus. Although subsequent studies have demonstrated many hundreds of mutant CFTR alleles worldwide, it was the high frequency of the ΔF508 mutation in the families used to map the CF gene and the LD between it and alleles at polymorphic marker loci nearby that proved so helpful in the ultimate identification of the CFTR gene.

Mapping of the CF locus and cloning of the CFTR gene made possible a wide range of research advances and clinical applications, from basic pathophysiology to molecular diagnosis for genetic counseling, prenatal diagnosis, animal models, and finally current ongoing attempts to treat the disorder (see Chapter 12).

Finding the Genes Contributing to a Complex Disease by Genome-Wide Association

Example: Age-Related Macular Degeneration

AMD is a progressive degenerative disease of the portion of the retina responsible for central vision. It causes blindness in 1.75 million Americans older than 50 years. The disease is characterized by the presence of drusen, which are clinically visible, discrete extracellular deposits of protein and lipids behind the retina in the region of the macula (Case 3). Although there is ample evidence for a genetic contribution to the disease, most individuals with AMD are not in families in which there is a likely mendelian pattern of inheritance. Environmental contributions are also important, as shown by the increased risk for AMD in cigarette smokers compared with nonsmokers.

Initial case-control GWAS of AMD revealed association of two common SNPs near the complement factor H (CFH) gene. The most frequent at-risk haplotype containing these alleles was seen in 50% of cases versus only 29% of controls (OR = 2.46; 95% confidence interval [CI], 1.95 to 3.11). Homozygosity for this haplotype was found in 24.2% of cases, compared to only 8.3% of the controls (OR = 3.51; 95% CI, 2.13-5.78). A search through the SNPs within the LD block containing the AMD-associated haplotype revealed a nonsynonymous SNP in the CFH gene that substituted a histidine for tyrosine at position 402 of the CFH protein (Tyr402His). The Tyr402His alteration, which has an allele frequency of 26% to 29% in white and African populations, showed an even stronger association with AMD than did the two SNPs that showed an association in the original GWAS.

Given that drusen contain complement factors and that CFH is found in retinal tissues around drusen, it is believed that the Tyr402His variant is less protective against the inflammation that is thought to be responsible for drusen formation and retinal damage. Thus Tyr402His is likely to be the variant at the CFH locus responsible for increasing the risk for AMD.

More recent GWAS of AMD using more than 7600 cases and more than 50,000 controls and millions of variants genome-wide have revealed that alleles at a minimum of 19 loci are associated with AMD, with genome-wide significance of P < 5 × 10−8. A popular way to summarize GWAS in graphic form is to plot the −log10 significance levels for each associated variant in what is referred to as a “Manhattan plot,” because it is thought to bear a somewhat fanciful similarity to the skyline of New York City (Fig. 10-11). The ORs for AMD of these variants range from a high of 2.76 for a gene of unknown function, ARMS2, and 2.48 for CFH to 1.1 for many other genes involved in multiple pathways, including the complement system, atherosclerosis, blood vessel formation, and others.

image

FIGURE 10-11 “Manhattan plot” of genome-wide association studies (GWAS) of age-related macular degeneration using approximately 1 million genome-wide single nucleotide polymorphism (SNP) alleles located along all 22 autosomes on the x-axis. Each blue dot represents the statistical significance (expressed as −log10(P) plotted on the y-axis), confirming a previously known association; green dots are the statistical significance for novel associations. The discontinuity in the y-axis is needed because some of the associations have extremely small P values < 1 × 10−16SeeSources & Acknowledgments.

In this example of AMD, a complex disease, GWAS led to the identification of strongly associated, common SNPs that in turn were in LD with a common coding SNP in the gene that appears to be the functional variant involved in the disease. This discovery in turn led to the identification of other SNPs in the complement cascade and elsewhere that can also predispose to or protect against the disease. Taken together, these results give important clues to the pathogenesis of AMD and suggest that the complement pathway might be a fruitful target for novel therapies. Equally interesting is that GWAS revealed that a novel gene of unknown function, ARMS2, is also involved, thereby opening up an entirely new line of research into the pathogenesis of AMD.

Importance of Associations Discovered with GWAS

There is vigorous debate regarding the interpretation of GWAS results and their value as a tool for human genetic studies. The debate arises primarily from a misunderstanding of what an OR or RR means. It is true that many properly executed GWAS yield significant associations, but of very modest effect size (similar to the OR of 1.1 just mentioned for AMD). In fact, significant associations of smaller and smaller effect size have become more common as larger and larger sample sizes are used that allow detection of statistically significant genome-wide associations with smaller and smaller ORs or RRs. This has led to the suggestion that GWAS are of little value because the effect size of the association, as measured by OR or RR, is too small for the gene and pathway implicated by that variant to be important in the pathogenesis of the disease. This is faulty reasoning on two accounts.

First, ORs are a measure of the impact of a specific allele (e.g., the CFH Tyr402His allele for AMD) on complex pathogenetic pathways, such as the alternative complement pathway of which CFH is a component. The subtlety of that impact is determined by how that allele perturbs the biological function of the gene in which it is located, and not by whether the gene harboring that allele might be important in disease pathogenesis. In autoimmune disorders, for example, studies of patients with a number of different autoimmune disorders, such as rheumatoid arthritissystemic lupus erythematosus, and Crohn disease, reveal modest associations, but with some of the same variants, suggesting there are common pathways leading to these distinct but related diseases that will likely be quite illuminating in studies of their pathogenesis (see Box).

Second, even if the effect size of any one variant is small, GWAS demonstrate that many of these disorders are indeed extremely polygenic, even more polygenic than previously suspected, with thousands of variants, most of which contribute only a little (ORs between 1.01 and 1.1) to disease susceptibility by themselves but, in the aggregate, account for a substantial fraction of the observed clustering of these diseases within certain families (see Chapter 8).

Although the observation of modest effect size for most alleles found by GWAS is correct, it misses a critical and perhaps most fundamental finding of GWAS: the genetic architecture of some of the most common complex diseases studied to date may involve hundreds to thousands of loci harboring variants of small effect in many genes and pathways. These genes and pathways are important to our understanding of how complex diseases occur, even if each allele exerts only subtle effects on gene regulation or protein function and has only a modest effect on disease susceptibility on a per allele basis.

Thus GWAS remain an important human genetics research tool for dissecting the many contributions to complex disease, regardless of whether or not the individual variants found to be associated with the disease substantially raise the risk for the disease in individuals carrying those alleles (see Chapter 16). We expect that many more genetic variants responsible for complex diseases will be successfully identified by genome-wide association and that deep sequencing of the regions showing disease associations should uncover the variants or collections of variants functionally responsible for disease associations. Such findings should provide us with powerful insights and potential therapeutic targets for many of the common diseases that cause so much morbidity and mortality in the population.

From GWAS to PheWAS

In genome-wide association studies (GWAS), one explores the genetic basis for a given phenotype, disease, or trait by searching for associations with large, unbiased collections of DNA markers from the entire genome. But can one do the reverse? Can one uncover the potential phenotypic links associated with genome variants by searching for associations with large, unbiased collections of phenotypes from the entire “phenome?” Thus far, the results of this approach appear to be highly promising.

In an approach dubbed phenome-wide association studies (PheWAS), genetic variants are tested for association, not just with a particular phenotype of interest (say, rheumatoid arthritis or systolic blood pressure above 160 mm Hg), but with all medically relevant phenotypes and laboratory values found in electronic medical records (EMRs). In this way, one can seek novel and unanticipated associations in an unbiased manner, using search algorithms, billing codes, and open text mining to query all electronic entries, which are fast becoming available for health records in many countries.

As an illustration of this approach, SNPs for a major class II HLA-DRB1 haplotype (as described in Chapter 8) were screened against over 4800 phenotypes in EMRs from over 4000 patients; this PheWAS detected association not only with multiple sclerosis (as expected from previous studies), but also with alcohol-induced cirrhosis of the liver, erythematous conditions such as rosacea, various benign neoplasms, and several dozen other phenotypes.

Although the potential of PheWAS is just being realized, such unbiased interrogation of vast clinical data sets may allow discovery of previously unappreciated comorbidities and/or less common side effects or drug-drug interactions in patients receiving prescribed drugs.

Finding Genes Responsible for Disease by Genome Sequencing

Thus far in this chapter, we have focused on two approaches to map and then identify genes involved in disease, linkage analysis and GWAS. Now we turn to a third approach, involving direct genome sequencing of affected individuals and their parents and/or other individuals in the family or population.

The development of vastly improved methods of DNA sequencing, which has cut the cost of sequencing six orders of magnitude from what was spent generating the Human Genome Project's reference sequence, has opened up new possibilities for discovering the genes and mutations responsible for disease, particularly in the case of rare mendelian disorders. As introduced in Chapter 4, these new technologies make it possible to generate a whole-genome sequence (WGS) or, in what may be a cost-effective compromise, the sequence of only the approximately 2% of the genome containing the exons of genes, referred to as a whole-exome sequence (WES).

Filtering Whole-Genome Sequence or Whole-Exome Sequence Data to Find Potential Causative Variants

As an example of what is now possible, consider a family “trio” consisting of a child affected with a rare disorder and his parents. WGS is performed for all three, yielding typically over 4 million differences compared to the human genome reference sequence (see Chapter 4). Which of these variants is responsible for the disease? Extracting useful information from this massive amount of data relies on creating a variant filtering scheme based on a variety of reasonable assumptions about which variants are more likely to be responsible for the disease.

One example of a filtering scheme that can be used to sort through these variants is shown in Figure 10-12.

1. Location with respect to protein-coding genes. Keep variants that are within or near exons of protein-coding genes, and discard variants deep within introns or intergenic regions. It is possible, of course, that the responsible mutation might lie in a noncoding RNA gene or in regulatory sequences located some distance from a gene, as introduced in Chapter 3. However, these are currently more difficult to assess, and thus, as a simplifying assumption, it is reasonable to focus initially on protein-coding genes.

2. Population frequency. Keep rare variants from step 1, and discard common variants with allele frequencies greater than 0.05 (or some other arbitrary number between 0.01 and 0.1), because common variants are highly unlikely to be responsible for a disease whose population prevalence is much less than the q2 predicted by Hardy-Weinberg equilibrium (see Chapter 9).

3. Deleterious nature of the mutation. Keep variants from step 2 that cause nonsense or nonsynonymous changes in codons within exons, cause frameshift mutations, or alter highly conserved splice sites, and discard synonymous changes that have no predicted effect on gene function.

4. Consistency with likely inheritance pattern. If the disorder is considered most likely to be autosomal recessive, keep any variants from step 3 that are found in both copies of a gene in an affected child. The child need not be homozygous for the same deleterious variant but could be a compound heterozygote for two different deleterious mutations in the same gene (see Chapter 7). If the hypothesized mode of inheritance is correct, then the parents should both be heterozygous for the variants. If there were consanguinity in the parents, the candidate genes and variants might be further filtered by requiring that the child be a true homozygote for the same mutation derived from a single common ancestor (see Chapter 9). If the disorder is severe and seems more likely to be a new dominant mutation, because unaffected parents rarely if ever have more than one affected child, keep variants from step 3 that are de novo changes in the child and are not present in either parent.

image

FIGURE 10-12 Representative filtering scheme for reducing the millions of variants detected in whole-genome sequencing of a family consisting of two unaffected parents and an affected child to a small number that can be assessed for biological and disease relevance. The initial enormous collection of variants is winnowed down into smaller and smaller bins by applying filters that remove variants that are unlikely to be causative based on assuming that variants of interest are likely to be located near a gene, will disrupt its function, and are rare. Each remaining candidate gene is then assessed for whether the variants in that gene are inherited in a manner that fits the most likely inheritance pattern of the disease, whether a variant occurs in a candidate gene that makes biological sense given the phenotype in the affected child, and whether other affected individuals also have mutations in that gene. AR, Autosomal recessive; mRNA, messenger RNA.

In the end, millions of variants can be filtered down to a handful occurring in a small number of genes. Once the filtering reduces the number of genes and alleles to a manageable number, they can be assessed for other characteristics. First, do any of the genes have a known function or tissue expression pattern that would be expected if it were the potential disease gene? Is the gene involved in other disease phenotypes, or does it have a role in pathways with other genes in which mutations can cause similar or different phenotypes? Finally, is this same gene mutated in other patients with the disease? Finding mutations in one of these genes in other patients would then confirm this was the responsible gene in the original trio.

In some cases, one gene from the list in step 4 may rise to the top as a candidate because its involvement makes biological or genetic sense or it is known to be mutated in other affected individuals. In other cases, however, the gene responsible may turn out to be entirely unanticipated on biological grounds or may not be mutated in other affected individuals because of locus heterogeneity (i.e., mutations in other as yet undiscovered genes can cause a similar disease).

Such variant assessments require extensive use of public genomic databases and software tools. These include the human genome reference sequence, databases of allele frequencies, software that assesses how deleterious an amino acid substitution might be to gene function, collections of known disease-causing mutations, and databases of functional networks and biological pathways. The enormous expansion of this information over the past few years has played a crucial role in facilitating gene discovery of rare mendelian disorders.

Example: Identification of the Gene Mutated in Postaxial Acrofacial Dysostosis

The WGS approach just outlined was used in the study of a family in which two siblings affected with a rare congenital malformation known as postaxial acrofacial dysostosis (POAD) were born to two unaffected, unrelated parents. Patients with this disorder have small jaws, missing or poorly developed digits on the ulnar sides of their hands, underdevelopment of the ulna, cleft lip, and clefts (colobomas) of the eyelids. The disorder was thought to be autosomal recessive because the parents of an affected child in some other families are consanguineous, and there are a few families, like the one here, with multiple affected siblings born to unaffected parents—both findings that are hallmarks of recessive inheritance (see Chapter 7). This small family alone was clearly inadequate for linkage analysis. Instead, all four members of the family had their entire genomes sequenced and analyzed.

From an initial list of more than 4 million variants and assuming autosomal recessive inheritance of the disorder in both affected children, a filtering scheme similar to that described earlier yielded only four possible genes. One of these, DHODH, was also shown to be mutated in two other unrelated patients with POAD, thereby confirming this gene was responsible for the disorder in these families. DHODHencodes dihydroorotate dehydrogenase, a mitochondrial enzyme involved in pyrimidine biosynthesis, and was not suspected on biological grounds to be the gene responsible for this malformation syndrome.

Applications of Whole-Genome Sequence or Whole-Exome Sequence in Clinical Settings

Since the application of WGS or WES to rare mendelian disorders was first described in 2009, many hundreds of such disorders have been studied and the causative mutations found in over 300 previously unrecognized disease genes. Although the genome sequencing approach may miss certain categories of mutation that are difficult to detect routinely by sequencing alone (e.g., deletions or copy number variants) or that are difficult or impossible to recognize with our current understanding (e.g., noncoding mutations or regulatory mutations in intergenic regions), many groups report up to 25% to 40% success rates in identifying a causative mutation. These discoveries not only provide information useful for genetic counseling in the families involved, but also may inform clinical management and the potential development of effective treatments.

It is anticipated that the success rate of this approach will only increase as the costs of sequencing continue to fall and as our ability to interpret the likely functional consequences of sequence changes in the genome improves.

General References

Altshuler D, Daly MJ, Lander ES. Genetic mapping in human disease. Science. 2008;322:881–888.

Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363:166–176.

Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517.

Terwilliger JD, Ott J. Handbook of human genetic linkage. Johns Hopkins University Press: Baltimore; 1994.

References for Specific Topics

Abecasis GR, Auton A, Brooks LD, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.

Bainbridge MN, Wiszniewski W, Murdock DR, et al. Whole-genome sequencing for optimized patient management. Science Transl Med. 2011;3:87re3.

Bush WS, Moore JH. Genome-wide association studies. PLoS Computational Biol. 2012;8:e1002822.

Denny JC, Bastarache L, Ritchie MD, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association data. Nat Biotechnol. 2013;31:1102–1110.

Fritsche LG, Chen W, Schu M, et al. Seven new loci associated with age-related macular degeneration. Nat Genet. 2013;17:1783–1786.

Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med. 2012;63:35–61.

Hindorff LA, MacArthur J, Morales J, : A catalog of published genome-wide association studies. Available at: www.genome.gov/gwastudies. Accessed February 1, 2015.

International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861.

Kircher M, Witten DM, Jain P, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–315.

Koboldt DC, Steinberg KM, Larson DE, et al. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38.

Manolio TA. Bringing genome-wide association findings into clinical use. Nat Rev Genet. 2014;14:549–558.

Matise TC, Chen F, Chen W, et al. A second-generation combined linkage-physical map of the human genome. Genome Res. 2007;17:1783–1786.

Roach JC, Glusman G, Smit AF, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639.

Robinson PC, Brown MA. Genetics of ankylosing spondylitis. Mol Immunol. 2014;57:2–11.

SEARCH Collaborative Group. SLCO1B1 variants and statin-induced myopathy—a genomewide study. N Engl J Med. 2008;359:789–799.

Stahl EA, Wegmann D, Trynka G, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genet. 2012;44:4383–4391.

Yang Y, Muzny DM, Reid JG, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med. 2013;369:1502–1511.

Problems

1. The Huntington disease (HD) locus was found to be tightly linked to a DNA polymorphism on chromosome 4. In the same study, however, linkage was ruled out between HD and the locus for the MNSs blood group polymorphism, which also maps to chromosome 4. What is the explanation?

2. LOD scores (Z) between a polymorphism in the α-globin locus on the short arm of chromosome 16 and an autosomal dominant disease was analyzed in a series of British and Dutch families, with the following data:

image

Zmax = 25.85 at θmax = 0.05

How would you interpret these data? Why is the value of Z given as −∞ at θ = 0?

In a subsequent study, a large family from Sicily with what looks like the same disease was also investigated for linkage to α-globin, with the following results:

image

How would you interpret the data in this second study?

3. This pedigree was obtained in a study designed to determine whether a mutation in a gene for γ-crystallin, one of the major proteins of the eye lens, may be responsible for an autosomal dominant form of cataract. The filled-in symbols in the pedigree indicate family members with cataracts. The letters indicate three alleles at the polymorphic γ-crystallin locus on chromosome 2. If you examine each affected person who has passed on the cataract to his or her children, how many of these represent a meiosis that is informative for linkage between the cataract and γ-crystallin? In which individuals is the phase known between the cataract mutation and the γ-crystallin alleles? Are there any meioses in which a crossover must have occurred to explain the data? What would you conclude about linkage between the cataract and γ-crystallin from this study? What additional studies might be performed to confirm or reject the hypothesis?

image

4. The following pedigree shows an example of molecular diagnosis in Wiskott-Aldrich syndrome, an X-linked immunodeficiency, by use of a linked DNA polymorphism with a map distance of approximately 5 cM between the polymorphic locus and the Wiskott-Aldrich syndrome gene.

a. What is the likely phase in the carrier mother? How did you determine this? What diagnosis would you make regarding the current prenatal diagnosis if it were a male fetus?

b. The maternal grandfather now becomes available for DNA testing and shows allele B at the linked locus. How does this finding affect your determination of phase in the mother? What diagnosis would you make now in regard to the current prenatal diagnosis?

image

5. Review the pedigree in Figure 10-10B. If the unaffected grandmother, I-2, had been an A/a heterozygote, would it be possible to determine the phase in the affected parent, individual II-2?

6. In the pedigree below, showing a family with X-linked hemophilia A, can you determine the phase of the mutant factor VIII gene (h) and the normal allele (H) with respect to polymorphic alleles M and m in the mother of the two affected boys?

image

Pedigree of X-linked hemophilia. The affected grandfather in the first generation has the disease (mutant allele h) and allele M at a polymorphic locus on the X chromosome.

7. Calculate D′ for the three scenarios listed in Figure 10-7.

8. Relative risk calculations are used for cohort studies and not case-control studies. To demonstrate why this is the case, imagine a case-control study for the effect of a genetic variant on disease susceptibility. The investigator has ascertained as many affected individuals (a + c) as possible and then arbitrarily chooses a set of (b + d) controls. They are genotyped as to whether a variant is present: a/(a + c) of the affected have the variant, whereas b/(b + d) of the controls have the variant.

 

Disease Present

Disease Absent

Variant present

a

b

Variant absent

c

d

 

a + c

b + d

Calculate the odds ratio and relative risk for the association between the variant being present and the disease being present.

Now, imagine the investigator arbitrarily decided to use three times as many unaffected individuals, 3 × (b + d), as controls. The investigator has every right to do so because it is a case-control study and the numbers of affected and unaffected are not determined by the prevalence of the disease in the population being studied, as they would be in a cohort study. Assume the distribution of the variant remains the same in this control group as with the smaller control group that is, 3b/[3 × (b + d)] = b/(b + d) carrying the allele.

 

Disease Present

Disease Absent

Variant present

a

3b

Variant absent

c

3d

 

a + c

3 × (b + d)

Recalculate the OR and RR with this new control group. Do the same when an arbitrary control group is an n-tuple of the original control group; that is, the size of the control group is n × (b + d).

Which of these measures, OR or RR, does not change when different, arbitrarily sized control groups are used?