Mutation and Polymorphism
The study of genetic and genomic variation is the conceptual cornerstone for genetics in medicine and for the broader field of human genetics. During the course of evolution, the steady influx of new nucleotide variation has ensured a high degree of genetic diversity and individuality, and this theme extends through all fields in human and medical genetics. Genetic diversity may manifest as differences in the organization of the genome, as nucleotide changes in the genome sequence, as variation in the copy number of large segments of genomic DNA, as alterations in the structure or amount of proteins found in various tissues, or as any of these in the context of clinical disease.
This chapter is one of several in which we explore the nature of genetically determined differences among individuals. The sequence of nuclear DNA is approximately 99.5% identical between any two unrelated humans. Yet it is precisely the small fraction of DNA sequence difference among individuals that is responsible for the genetically determined variability that is evident both in one's daily existence and in clinical medicine. Many DNA sequence differences have little or no effect on outward appearance, whereas other differences are directly responsible for causing disease. Between these two extremes is the variation responsible for genetically determined variability in anatomy, physiology, dietary intolerances, susceptibility to infection, predisposition to cancer, therapeutic responses or adverse reactions to medications, and perhaps even variability in various personality traits, athletic aptitude, and artistic talent.
One of the important concepts of human and medical genetics is that diseases with a clearly inherited component are only the most obvious and often the most extreme manifestation of genetic differences, one end of a continuum of variation that extends from rare deleterious variants that cause illness, through more common variants that can increase susceptibility to disease, to the most common variation in the population that is of uncertain relevance with respect to disease.
The Nature of Genetic Variation
As described in Chapter 2, a segment of DNA occupying a particular position or location on a chromosome is a locus (plural loci). A locus may be large, such as a segment of DNA that contains many genes, such as the major histocompatibility complex locus involved in the response of the immune system to foreign substances; it may be a single gene, such as the β-globin locus we introduced in Chapter 3; or it may even be just a single base in the genome, as in the case of a single nucleotide variant (see Fig. 2-6 and later in this chapter). Alternative versions of the DNA sequence at a locus are called alleles. For many genes, there is a single prevailing allele, usually present in more than half of the individuals in a population, that geneticists call the wild-type or common allele. (In lay parlance, this is sometimes referred to as the “normal” allele. However, because genetic variation is itself very much “normal,” the existence of different alleles in “normal” individuals is commonplace. Thus one should avoid using “normal” to designate the most common allele.) The other versions of the gene are variant (or mutant) alleles that differ from the wild-type allele because of the presence of a mutation, a permanent change in the nucleotide sequence or arrangement of DNA. Note that the terms mutation and mutant refer to DNA, but not to the human beings who carry mutant alleles. The terms denote a change in sequence but otherwise do not carry any connotation with respect to the function or fitness of that change.
The frequency of different variants can vary widely in different populations around the globe, as we will explore in depth in Chapter 9. If there are two or more relatively common alleles (defined by convention as having an allele frequency > 1%) at a locus in a population, that locus is said to exhibit polymorphism (literally “many forms”) in that population. Most variant alleles, however, are not frequent enough in a population to be considered polymorphisms; some are so rare as to be found in only a single family and are known as “private” alleles.
The Concept of Mutation
In this chapter, we begin by exploring the nature of mutation, ranging from the change of a single nucleotide to alterations of an entire chromosome. To recognize a change means that there has to be a “gold standard,” compared to which the variant shows a difference. As we saw in Chapter 2, there is no single individual whose genome sequence could serve as such a standard for the human species, and thus one arbitrarily designates the most common sequence or arrangement in a population at any one position in the genome as the so-called reference sequence (see Fig. 2-6). As more and more genomes from individuals around the globe are sampled (and thus as more and more variation is detected among the currently 7 billion genomes that make up our species), this reference genome is subject to constant evaluation and change. Indeed, a number of international collaborations share and update data on the nature and frequency of DNA variation in different populations in the context of the reference human genome sequence and make the data available through publicly accessible databases that serve as essential resources for scientists, physicians, and other health care professionals (Table 4-1).
Useful Databases of Information on Human Genetic Diversity
The Human Genome Project, completed in 2003, was an international collaboration to sequence and map the genome of our species. The draft sequence of the genome was released in 2001, and the “essentially complete” reference genome assembly was published in 2004.
The Single Nucleotide Polymorphism Database (dbSNP) and the Structural Variation Database (dbVar) are databases of small-scale and large-scale variations, including single nucleotide variants, microsatellites, indels, and CNVs.
The 1000 Genomes Project is sequencing the genomes of a large number of individuals to provide a comprehensive resource on genetic variation in our species. All data are publicly available.
The Human Gene Mutation Database is a comprehensive collection of germline mutations associated with or causing human inherited disease (currently including over 120,000 mutations in 4400 genes).
The Database of Genomic Variants is a curated catalogue of structural variation in the human genome. As of 2012, the database contains over 400,000 entries, including over 200,000 CNVs, 1000 inversions, and 34,000 indels.
The Japanese Single Nucleotide Polymorphisms Database (JSNP Database) reports SNPs discovered as part of the Millennium Genome Project.
CNV, Copy number variant; SNP, single nucleotide polymorphism.
Updated from Willard HF: The human genome: a window on human genetics, biology and medicine. In Ginsburg GS, Willard HF, editors: Genomic and personalized medicine, ed 2, New York, 2013, Elsevier.
Mutations are sometimes classified by the size of the altered DNA sequence and, at other times, by the functional effect of the mutation on gene expression. Although classification by size is somewhat arbitrary, it can be helpful conceptually to distinguish among mutations at three different levels:
• Mutations that leave chromosomes intact but change the number of chromosomes in a cell (chromosome mutations)
• Mutations that change only a portion of a chromosome and might involve a change in the copy number of a subchromosomal segment or a structural rearrangement involving parts of one or more chromosomes (regional or subchromosomal mutations)
• Alterations of the sequence of DNA, involving the substitution, deletion, or insertion of DNA, ranging from a single nucleotide up to an arbitrarily set limit of approximately 100 kb (gene or DNA mutations)
The basis for and consequences of this third type of mutation are the principal focus of this chapter, whereas both chromosome and regional mutations will be presented at length in Chapters 5 and 6.
The functional consequences of DNA mutations, even those that change a single base pair, run the gamut from being completely innocuous to causing serious illness, all depending on the precise location, nature, and size of the mutation. For example, even a mutation within a coding exon of a gene may have no effect on how a gene is expressed if the change does not alter the primary amino acid sequence of the polypeptide product; even if it does, the resulting change in the encoded amino acid sequence may not alter the functional properties of the protein. Not all mutations, therefore, are manifest in an individual.
The Concept of Genetic Polymorphism
The DNA sequence of a given region of the genome is remarkably similar among chromosomes carried by many different individuals from around the world. In fact, any randomly chosen segment of human DNA approximately 1000 bp in length contains, on average, only one base pair that is different between the two homologous chromosomes inherited from that individual's parents (assuming the parents are unrelated). However, across all human populations, many tens of millions of single nucleotide differences and over a million more complex variants have been identified and catalogued. Because of limited sampling, these figures are likely to underestimate the true extent of genetic diversity in our species. Many populations around the globe have yet to be studied, and, even in the populations that have been studied, the number of individuals examined is too small to reveal most variants with minor allele frequencies below 1% to 2%. Thus, as more people are included in variant discovery projects, additional (and rarer) variants will certainly be uncovered.
Whether a variant is formally considered a polymorphism or not depends entirely on whether its frequency in a population exceeds 1% of the alleles in that population, and not on what kind of mutation caused it, how large a segment of the genome is involved, or whether it has a demonstrable effect on the individual. The location of a variant with respect to a gene also does not determine whether the variant is a polymorphism. Although most sequence polymorphisms are located between genes or within introns and are inconsequential to the functioning of any gene, others may be located in the coding sequence of genes themselves and result in different protein variants that may lead in turn to distinctive differences in human populations. Still others are in regulatory regions and may also have important effects on transcription or RNA stability.
One might expect that deleterious mutations that cause rare monogenic diseases are likely to be too rare to achieve the frequency necessary to be considered a polymorphism. Although it is true that the alleles responsible for most clearly inherited clinical conditions are rare, some alleles that have a profound effect on health—such as alleles of genes encoding enzymes that metabolize drugs (for example, sensitivity to abacavir in some individuals infected with human immunodeficiency virus [HIV]) (Case 1), or the sickle cell mutation in African and African American populations (see Chapter 11) (Case 42)—are relatively common. Nonetheless, these are exceptions, and, as more and more genetic variation is discovered and catalogued, it is clear that the vast majority of variants in the genome, whether common or rare, reflect differences in DNA sequence that have no known significance to health.
Polymorphisms are key elements for the study of human and medical genetics. The ability to distinguish different inherited forms of a gene or different segments of the genome provides critical tools for a wide array of applications, both in research and in clinical practice (see Box).
Polymorphisms and Inherited Variation in Human and Medical Genetics
Allelic variants can be used as “markers” for tracking the inheritance of the corresponding segment of the genome in families and in populations. Such variants can be used as follows:
• As powerful research tools for mapping a gene to a particular region of a chromosome by linkage analysis or by allelic association (see Chapter 10)
• For prenatal diagnosis of genetic disease and for detection of carriers of deleterious alleles (see Chapter 17), as well as in blood banking and tissue typing for transfusions and organ transplantation
• In forensic applications such as identity testing for determining paternity, identifying remains of crime victims, or matching a suspect's DNA to that of the perpetrator (this chapter)
• In the ongoing efforts to provide genomic-based personalized medicine (see Chapter 18) in which one tailors an individual's medical care to whether or not he or she carries variants that increase or decrease the risk for common adult disorders (such as coronary heart disease, cancer, and diabetes; see Chapter 8) or that influence the efficacy or safety of particular medications
Inherited Variation and Polymorphism in DNA
The original Human Genome Project and the subsequent study of now many thousands of individuals worldwide have provided a vast amount of DNA sequence information. With this information in hand, one can begin to characterize the types and frequencies of polymorphic variation found in the human genome and to generate catalogues of human DNA sequence diversity around the globe. DNA polymorphisms can be classified according to how the DNA sequence varies between the different alleles (Table 4-2 and Figs. 4-1 and 4-2).
Common Variation in the Human Genome
bp, Base pair; kb, kilobase pair; Mb, megabase pair.
FIGURE 4-1 Three polymorphisms in genomic DNA from the segment of the human genome reference assembly shown at the top (see also Fig. 2-6). The single nucleotide polymorphism (SNP) at position 8 has two alleles, one with a T (corresponding to the reference sequence) and one with a C. There are two indels in this region. At indel A, allele 2 has an insertion of a G between positions 11 and 12 in the reference sequence (allele 1). At indel B, allele 2 has a 2-bp deletion of positions 5 and 6 in the reference sequence.
FIGURE 4-2 Examples of polymorphism in the human genome larger than SNPs. Clockwise from upper right: The microsatellite locus has three alleles, with four, five, or six copies of a CAA trinucleotide repeat. The inversion polymorphism has two alleles corresponding to the two orientations (indicated by the arrows) of the genomic segment shown in green; such inversions can involve regions up to many megabases of DNA. Copy number variants involve deletion or duplication of hundreds of kilobase pairs to over a megabase of genomic DNA. In the example shown, allele 1 contains a single copy, whereas allele 2 contains three copies of the chromosomal segment containing the F and G genes; other possible alleles with zero, two, four, or more copies of F and G are not shown. The mobile element insertion polymorphism has two alleles, one with and one without insertion of an approximately 6 kb LINE repeated retroelement; the insertion of the mobile element changes the spacing between the two genes and may alter gene expression in the region.
Single Nucleotide Polymorphisms
The simplest and most common of all polymorphisms are single nucleotide polymorphisms (SNPs). A locus characterized by a SNP usually has only two alleles, corresponding to the two different bases occupying that particular location in the genome (see Fig. 4-1). As mentioned previously, SNPs are common and are observed on average once every 1000 bp in the genome. However, the distribution of SNPs is uneven around the genome; many more SNPs are found in noncoding parts of the genome, in introns and in sequences that are some distance from known genes. Nonetheless, there is still a significant number of SNPs that do occur in genes and other known functional elements in the genome. For the set of protein-coding genes, over 100,000 exonic SNPs have been documented to date. Approximately half of these do not alter the predicted amino acid sequence of the encoded protein and are thus termed synonymous, whereas the other half do alter the amino acid sequence and are said to be nonsynonymous. Other SNPs introduce or change a stop codon (see Table 3-1), and yet others alter a known splice site; such SNPs are candidates to have significant functional consequences.
The significance for health of the vast majority of SNPs is unknown and is the subject of ongoing research. The fact that SNPs are common does not mean that they are without effect on health or longevity. What it does mean is that any effect of common SNPs is likely to involve a relatively subtle altering of disease susceptibility rather than a direct cause of serious illness.
A second class of polymorphism is the result of variations caused by insertion or deletion (in/dels or simply indels) of anywhere from a single base pair up to approximately 1000 bp, although larger indels have been documented as well. Over a million indels have been described, numbering in the hundreds of thousands in any one individual's genome. Approximately half of all indels are referred to as “simple” because they have only two alleles—that is, the presence or absence of the inserted or deleted segment (see Fig. 4-1).
Other indels, however, are multiallelic due to variable numbers of the segment of DNA that is inserted in tandem at a particular location, thereby constituting what is referred to as a microsatellite. They consist of stretches of DNA composed of units of two, three, or four nucleotides, such as TGTGTG, CAACAACAA, or AAATAAATAAAT, repeated between one and a few dozen times at a particular site in the genome (see Fig. 4-2). The different alleles in a microsatellite polymorphism are the result of differing numbers of repeated nucleotide units contained within any one microsatellite and are therefore sometimes also referred to as short tandem repeat (STR) polymorphisms. A microsatellite locus often has many alleles (repeat lengths) that can be rapidly evaluated by standard laboratory procedures to distinguish different individuals and to infer familial relationships (Fig. 4-3). Many tens of thousands of microsatellite polymorphic loci are known throughout the human genome.
FIGURE 4-3 A schematic of a hypothetical microsatellite marker in human DNA. The different-sized alleles (numbered 1 to 7) correspond to fragments of genomic DNA containing different numbers of copies of a microsatellite repeat, and their relative lengths are determined by separating them by gel electrophoresis. The shortest allele (allele 1) migrates toward the bottom of the gel, whereas the longest allele (allele 7) remains closest to the top. Left, For this multiallelic microsatellite, each of the six unrelated individuals has two different alleles. Right, Within a family, the inheritance of alleles can be followed from each parent to each of the three children.
Microsatellites are a particularly useful group of indels. Determining the alleles at multiple microsatellite loci is currently the method of choice for DNA fingerprinting used for identity testing. For example, the Federal Bureau of Investigation (FBI) in the United States currently uses the collection of alleles at 13 such loci for its DNA fingerprinting panel. Two individuals (other than monozygotic twins) are so unlikely to have exactly the same alleles at all 13 loci that the panel will allow definitive determination of whether two samples came from the same individual. The information is stored in the FBI's Combined DNA Index System (CODIS), which has grown as of December 2014 to include over 11,548,700 offender profiles, 1,300,000 arrestee profiles, and 601,600 forensic profiles (material obtained at crime scenes). Many states and the U.S. Department of Defense have similar databases of DNA fingerprints, as do corresponding units in other countries.
Mobile Element Insertion Polymorphisms
Nearly half of the human genome consists of families of repetitive elements that are dispersed around the genome (see Chapter 2). Although most of the copies of these repeats are stationary, some of them are mobile and contribute to human genetic diversity through the process of retrotransposition, a process that involves transcription into an RNA, reverse transcription into a DNA sequence, and insertion (i.e., transposition) into another site in the genome, as we introduced in Chapter 3 in the context of processed pseudogenes. The two most common mobile element families are the Alu and LINE families of repeats, and nearly 10,000 mobile element insertion polymorphisms have been described in different populations. Each polymorphic locus consists of two alleles, one with and one without the inserted mobile element (see Fig. 4-2). Mobile element polymorphisms are found on all human chromosomes; although most are found in nongenic regions of the genome, a small proportion of them are found within genes. At least 5000 of these polymorphic loci have an insertion frequency of greater than 10% in various populations.
Copy Number Variants
Another important type of human polymorphism includes copy number variants (CNVs). CNVs are conceptually related to indels and microsatellites but consist of variation in the number of copies of larger segments of the genome, ranging in size from 1000 bp to many hundreds of kilobase pairs. Variants larger than 500 kb are found in 5% to 10% of individuals in the general population, whereas variants encompassing more than 1 Mb are found in 1% to 2%. The largest CNVs are sometimes found in regions of the genome characterized by repeated blocks of homologous sequences called segmental duplications(or segdups). Their importance in mediating duplication and deletion of the corresponding segments is discussed further in Chapter 6 in the context of various chromosomal syndromes.
Smaller CNVs in particular may have only two alleles (i.e., the presence or absence of a segment), similar to indels in that regard. Larger CNVs tend to have multiple alleles due to the presence of different numbers of copies of a segment of DNA in tandem (see Fig. 4-2). In terms of genome diversity between individuals, the amount of DNA involved in CNVs vastly exceeds the amount that differs because of SNPs. The content of any two human genomes can differ by as much as 50 to 100 Mb because of copy number differences at CNV loci.
Notably, the variable segment at many CNV loci can include one to as many as several dozen genes, and thus CNVs are frequently implicated in traits that involve altered gene dosage. When a CNV is frequent enough to be polymorphic, it represents a background of common variation that must be understood if alterations in copy number observed in patients are to be interpreted properly. As with all DNA polymorphism, the significance of different CNV alleles in health and disease susceptibility is the subject of intensive investigation.
A final group of polymorphisms to be discussed is inversions, which differ in size from a few base pairs to large regions of the genome (up to several megabase pairs) that can be present in either of two orientations in the genomes of different individuals (see Fig. 4-2). Most inversions are characterized by regions of sequence homology at the edges of the inverted segment, implicating a process of homologous recombination in the origin of the inversions. In their balanced form, inversions, regardless of orientation, do not involve a gain or loss of DNA, and the inversion polymorphisms (with two alleles corresponding to the two orientations) can achieve substantial frequencies in the general population. However, anomalous recombination can result in the duplication or deletion of DNA located between the regions of homology, associated with clinical disorders that we will explore further in Chapters 5 and 6.
The Origin and Frequency of Different Types of Mutations
Along the spectrum of diversity from rare variants to more common polymorphisms, the different kinds of mutations arise in the context of such fundamental processes of cell division as DNA replication, DNA repair, DNA recombination, and chromosome segregation in mitosis or meiosis. The frequency of mutations per locus per cell division is a basic measure of how error prone these processes are, which is of fundamental importance for genome biology and evolution. However, of greatest importance to medical geneticists is the frequency of mutations per disease locus per generation, rather than the overall mutation rate across the genome per cell division. Measuring disease-causing mutation rates can be difficult, however, because many mutations cause early embryonic lethality before the mutation can be recognized in a fetus or newborn, or because some people with a disease-causing mutation may manifest the condition only late in life or may never show signs of the disease. Despite these limitations, we have made great progress is determining the overall frequency—sometimes referred to as the genetic load—of all mutations affecting the human species.
The major types of mutation briefly introduced earlier occur at appreciable frequencies in many different cells in the body. In the practice of genetics, we are principally concerned with inherited genome variation; however, all such variation had to originate as a new (de novo) change occurring in germ cells. At that point, such a variant would be quite rare in the population (occurring just once), and its ultimate frequency in the population over time depends on chance and on the principles of inheritance and population genetics (see Chapters 7 and 9). Although the original mutation would have occurred only in the DNA of cells in the germline, anyone who inherits that mutation would then carry it as a constitutional mutation in all the cells of the body.
In contrast, somatic mutations occur throughout the body but cannot be transmitted to the next generation. Given the rate of mutation (see later in this section), one would predict that, in fact, every cell in an individual has a slightly different version of his or her genome, depending on the number of cell divisions that have occurred since conception to the time of sample acquisition. In highly proliferative tissues, such as intestinal epithelial cells or hematopoietic cells, such genomic heterogeneity is particularly likely to be apparent. However, most such mutations are not typically detected, because, in clinical testing, one usually sequences DNA from collections of many millions of cells; in such a collection, the most prevalent base at any position in the genome will be the one present at conception, and rare somatic mutations will be largely invisible and unascertained. Such mutations can be of clinical importance, however, in disorders caused by mutation in only a subset of cells in certain tissues, leading to somatic mosaicism (see Chapter 7).
The major exception to the expectation that somatic mutations will be typically undetected within any multicell DNA sample is in cancer, in which the mutational basis for the origins of cancer and the clonal nature of tumor evolution drives certain somatic changes to be present in essentially all the cells of a tumor. Indeed, 1000 to 10,000 somatic mutations (and sometimes many more) are readily found in the genomes of most adult cancers, with mutation frequencies and patterns specific to different cancer types (see Chapter 15).
Mutations that produce a change in chromosome number because of chromosome missegregation are among the most common mutations seen in humans, with a rate of one mutation per 25 to 50 meiotic cell divisions. This estimate is clearly a minimal one because the developmental consequences of many such events are likely so severe that the resulting fetuses are aborted spontaneously shortly after conception without being detected (see Chapters 5 and 6).
Mutations affecting the structure or regional organization of chromosomes can arise in a number of different ways. Duplications, deletions, and inversions of a segment of a single chromosome are predominantly the result of homologous recombination between DNA segments with high sequence homology located at more than one site in a region of a chromosome. Not all structural mutations are the result of homologous recombination, however. Others, such as chromosome translocations and some inversions, can occur at the sites of spontaneous double-stranded DNA breaks. Once breakage occurs at two places anywhere in the genome, the two broken ends can be joined together even without any obvious homology in the sequence between the two ends (a process termed nonhomologous end-joining repair). Examples of such mutations will be discussed in depth in Chapter 6.
Gene or DNA mutations, including base pair substitutions, insertions, and deletions (Fig. 4-4), can originate by either of two basic mechanisms: errors introduced during DNA replication or mutations arising from a failure to properly repair DNA after damage. Many such mutations are spontaneous, arising during the normal (but imperfect) processes of DNA replication and repair, whereas others are induced by physical or chemical agents called mutagens.
FIGURE 4-4 Examples of mutations in a portion of a hypothetical gene with five codons shown (delimited by the dotted lines). The first base pair of the second codon in the reference sequence (shaded in blue) is mutated by a base substitution, deletion, or insertion. The base substitution of a G for the T at this position leads to a codon change (shaded in green) and, assuming that the upper strand is the sense or coding strand, a predicted nonsynonymous change from a serine to an alanine in the encoded protein (see genetic code in Table 3-1); all other codons remain unchanged. Both the single base pair deletion and insertion lead to a frameshift mutation in which the translational reading frame is altered for all subsequent codons (shaded in green), until a termination codon is reached.
DNA Replication Errors
The process of DNA replication (see Fig. 2-4) is typically highly accurate; the majority of replication errors (i.e., inserting a base other than the complementary base that would restore the base pair at that position in the double helix) are rapidly removed from the DNA and corrected by a series of DNA repair enzymes that first recognize which strand in the newly synthesized double helix contains the incorrect base and then replace it with the proper complementary base, a process termed DNA proofreading. DNA replication needs to be a remarkably accurate process; otherwise, the burden of mutation on the organism and the species would be intolerable. The enzyme DNA polymerase faithfully duplicates the two strands of the double helix based on strict base-pairing rules (A pairs with T, C with G) but introduces one error every 10 million bp. Additional proofreading then corrects more than 99.9% of these errors of DNA replication. Thus the overall mutation rate per base as a result of replication errors is a remarkably low 1 × 10−10per cell division—fewer than one mutation per genome per cell division.
Repair of DNA Damage
It is estimated that, in addition to replication errors, between 10,000 and 1 million nucleotides are damaged per human cell per day by spontaneous chemical processes such as depurination, demethylation, or deamination; by reaction with chemical mutagens (natural or otherwise) in the environment; and by exposure to ultraviolet or ionizing radiation. Some but not all of this damage is repaired. Even if the damage is recognized and excised, the repair machinery may create mutations by introducing incorrect bases. Thus, in contrast to replication-related DNA changes, which are usually corrected through proofreading mechanisms, nucleotide changes introduced by DNA damage and repair often result in permanent mutations.
A particularly common spontaneous mutation is the substitution of T for C (or A for G on the other strand). The explanation for this observation comes from considering the major form of epigenetic modification in the human genome, DNA methylation, introduced in Chapter 3. Spontaneous deamination of 5-methylcytosine to thymidine (compare the structures of cytosine and thymine in Fig. 2-2) in the CpG doublet gives rise to C to T or G to A mutations (depending on which strand the 5-methylcytosine is deaminated). Such spontaneous mutations may not be recognized by the DNA repair machinery and thus become established in the genome after the next round of DNA replication. More than 30% of all single nucleotide substitutions are of this type, and they occur at a rate 25 times greater than those of any other single nucleotide mutations. Thus the CpG doublet represents a true “hot spot” for mutation in the human genome.
Overall Rate of DNA Mutations
Although the rate of DNA mutations at specific loci has been estimated using a variety of approaches over the past 50 years, the overall impact of replication and repair errors on the occurrence of new mutations throughout the genome can now be determined directly by whole-genome sequencing of trios consisting of a child and both parents, looking for new mutations in the child that are not present in the genome sequence of either parent. The overall rate of new mutations averaged between maternal and paternal gametes is approximately 1.2 × 10−8 mutations per base pair per generation. Thus every person is likely to receive approximately 75 new mutations in his or her genome from one or the other parent. This rate, however, varies from gene to gene around the genome and perhaps from population to population or even individual to individual. Overall, this rate, combined with considerations of population growth and dynamics, predicts that there must be an enormous number of relatively new (and thus very rare) mutations in the current worldwide population of 7 billion individuals.
As might be predicted, the vast majority of these mutations will be single nucleotide changes in noncoding portions of the genome and will probably have little or no functional significance. Nonetheless, at the level of populations, the potential collective impact of these new mutations on genes of medical importance should not be overlooked. In the United States, for example, with over 4 million live births each year, approximately 6 million new mutations will occur in coding sequences; thus, even for a single protein-coding gene of average size, we can anticipate several hundred newborns each year with a new mutation in the coding sequence of that gene.
Conceptually similar studies have determined the rate of mutations in CNVs, where the generation of a new length variant depends on recombination, rather than on errors in DNA synthesis to generate a new base pair. The measured rate of formation of new CNVs (≈1.2 × 10−2 per locus per generation) is orders of magnitude higher than that of base substitutions.
Rate of Disease-Causing Gene Mutations
The most direct way of estimating the rate of disease-causing mutations per locus per generation is to measure the incidence of new cases of a genetic disease that is not present in either parent and is caused by a single mutation that causes a condition that is clearly recognizable in all neonates who carry that mutation. Achondroplasia, a condition of reduced bone growth leading to short stature (Case 2), is a condition that meets these requirements. In one study, seven achondroplastic children were born in a series of 242,257 consecutive births. All seven were born to parents of normal stature, and, because achondroplasia always manifests when a mutation is present, all were considered to represent new mutations. The new mutation rate at this locus can be calculated to be seven new mutations in a total of 2 × 242,257 copies of the relevant gene, or approximately 1.4 × 10−5 disease-causing mutations per locus per generation. This high mutation rate is particularly striking because it has been found that virtually all cases of achondroplasia are due to the identical mutation, a G to A mutation that changes a glycine codon to an arginine in the encoded protein.
The rate of gene mutations that cause disease has been estimated for a number of other disorders in which the occurrence of a new mutation was determined by the appearance of a detectable disease (Table 4-3). The measured rates for these and other disorders vary over a 1000-fold range, from 10−4 to 10−7 mutations per locus per generation. The basis for these differences may be related to some or all of the following: the size of different genes; the fraction of all mutations in that gene that will lead to the disease; the age and sex of the parent in whom the mutation occurred; the mutational mechanism; and the presence or absence of mutational hot spots in the gene. Indeed, the high rate of the particular site-specific mutation in achondroplasia may be partially explained by the fact that the mutation on the other strand is a C to T change in a position that undergoes CpG methylation and is a hot spot for mutation by deamination, as discussed earlier.
Estimates of Mutation Rates for Selected Human Disease Genes
Achondroplasia (Case 2)
FGFR3 (fibroblast growth factor receptor 3)
1.4 × 10−5
2.9-5 × 10−6
Duchenne muscular dystrophy (Case 14)
3.5-10.5 × 10−5
Hemophilia A (Case 21)
F8 (factor VIII)
3.2-5.7 × 10−5
Hemophilia B (Case 21)
F9 (factor IX)
2-3 × 10−6
Neurofibromatosis, type 1 (Case 34)
4-10 × 10−5
Polycystic kidney disease, type 1 (Case 37)
6.5-12 × 10−5
Retinoblastoma (Case 39)
5-12 × 10−6
*Expressed as mutations per locus per generation.
Based on data in Vogel F, Motulsky AG: Human genetics, ed 3, Berlin, 1997, Springer-Verlag.
Notwithstanding this range of rates among different genes, the median gene mutation rate is approximately 1 × 10−6. Given that there are at least 5000 genes in the human genome in which mutations are currently known to cause a discernible disease or other trait (see Chapter 7), approximately 1 in 200 persons is likely to receive a new mutation in a known disease-associated gene from one or the other parent.
Sex Differences and Age Effects on Mutation Rates
Because the DNA in sperm has undergone far more replication cycles than has the DNA in ova (see Chapter 2), there is greater opportunity for errors to occur; one might predict, then, that many mutations will be more often paternal rather than maternal in origin. Indeed, where this has been explored, new mutations responsible for certain conditions (e.g., achondroplasia, as we just discussed) are usually missense mutations that arise nearly always in the paternal germline. Furthermore, the older a man is, the more rounds of replication have preceded the meiotic divisions, and thus the frequency of paternal new mutations might be expected to increase with the age of the father. In fact, correlations of the increasing age of the father have been observed with the incidence of gene mutations for a number of disorders (including achondroplasia) and with the incidence of regional mutations involving CNVs in autism spectrum disorders (Case 5). In other diseases, however, the parent-of-origin and age effects on mutational spectra are, for unknown reasons, not as striking.
Types of Mutations and Their Consequences
In this section, we consider the nature of different mutations and their effect on the genes involved. Each type of mutation discussed here is illustrated by one or more disease examples. Notably, the specific mutation found in almost all cases of achondroplasia is the exception rather than the rule, and the mutations that underlie a single genetic disease are more typically heterogeneous among a group of affected individuals. Different cases of a particular disorder will therefore usually be caused by different underlying mutations (Table 4-4). In Chapters 11 and 12, we will turn to the ways in which mutations in specific disease genes cause these diseases.
Types of Mutation in Human Genetic Disease
A single nucleotide substitution (or point mutation) in a gene sequence, such as that observed in the example of achondroplasia just described, can alter the code in a triplet of bases and cause the nonsynonymous replacement of one amino acid by another in the gene product (see the genetic code in Table 3-1 and the example in Fig. 4-4). Such mutations are called missense mutations because they alter the coding (or “sense”) strand of the gene to specify a different amino acid. Although not all missense mutations lead to an observable change in the function of the protein, the resulting protein may fail to work properly, may be unstable and rapidly degraded, or may fail to localize in its proper intracellular position. In many disorders, such as β-thalassemia (Case 44), most of the mutations detected in different patients are missense mutations (see Chapter 11).
Point mutations in a DNA sequence that cause the replacement of the normal codon for an amino acid by one of the three termination (or “stop”) codons are called nonsense mutations. Because translation of messenger RNA (mRNA) ceases when a termination codon is reached (see Chapter 3), a mutation that converts a coding exon into a termination codon causes translation to stop partway through the coding sequence of the mRNA. The consequences of premature termination mutations are twofold. First, the mRNA carrying a premature mutation is often targeted for rapid degradation (through a cellular process known as nonsense-mediated mRNA decay), and no translation is possible. And second, even if the mRNA is stable enough to be translated, the truncated protein is usually so unstable that it is rapidly degraded within the cell (see Chapter 12 for examples).
Whereas some point mutations create a premature termination codon, others may destroy the normal termination codon and thus permit translation to continue until another termination codon in the mRNA is reached further downstream. Such a mutation will lead to an abnormal protein product with additional amino acids at its carboxyl terminus, and may also disrupt regulatory functions normally provided by the 3′ untranslated region downstream from the normal stop codon.
Mutations Affecting RNA Transcription, Processing, and Translation
The normal mechanism by which initial RNA transcripts are made and then converted into mature mRNAs (or final versions of noncoding RNAs) requires a series of modifications, including transcription factor binding, 5′ capping, polyadenylation, and splicing (see Chapter 3). All of these steps in RNA maturation depend on specific sequences within the RNA. In the case of splicing, two general classes of splicing mutations have been described. For introns to be excised from unprocessed RNA and the exons spliced together to form a mature RNA requires particular nucleotide sequences located at or near the exon-intron (5′ donor site) or the intron-exon (3′ acceptor site) junctions. Mutations that affect these required bases at either the splice donor or acceptor site interfere with (and in some cases abolish) normal RNA splicing at that site. A second class of splicing mutations involves base substitutions that do not affect the donor or acceptor site sequences themselves but instead create alternative donor or acceptor sites that compete with the normal sites during RNA processing. Thus at least a proportion of the mature mRNA or noncoding RNA in such cases may contain improperly spliced intron sequences. Examples of both types of mutation are presented in Chapter 11.
For protein-coding genes, even if the mRNA is made and is stable, point mutations in the 5′ and 3′-untranslated regions can also contribute to disease by changing mRNA stability or translation efficiency, thereby reducing the amount of protein product that is made.
Deletions, Insertions, and Rearrangements
Mutations can also be caused by the insertion, deletion, or rearrangement of DNA sequences. Some deletions and insertions involve only a few nucleotides and are generally most easily detected by direct sequencing of that part of the genome. In other cases, a substantial segment of a gene or an entire gene is deleted, duplicated, inverted, or translocated to create a novel arrangement of gene sequences. Depending on the exact nature of the deletion, insertion, or rearrangement, a variety of different laboratory approaches can be used to detect the genomic alteration.
Some deletions and insertions affect only a small number of base pairs. When such a mutation occurs in a coding sequence and the number of bases involved is not a multiple of three (i.e., is not an integral number of codons), the reading frame will be altered beginning at the point of the insertion or deletion. The resulting mutations are called frameshift mutations (see Fig. 4-4). From the point of the insertion or deletion, a different sequence of codons is thereby generated that encodes incorrect amino acids followed by a termination codon in the shifted frame, typically leading to a functionally altered protein product. In contrast, if the number of base pairs inserted or deleted is a multiple of three, then no frameshift occurs and there will be a simple insertion or deletion of the corresponding amino acids in the otherwise normally translated gene product. Larger insertions or deletions, ranging from approximately 100 to more than 1000 bp, are typically referred to as “indels,” as we saw in the case of polymorphisms earlier. They can affect multiple exons of a gene and cause major disruptions of the coding sequence.
One type of insertion mutation involves insertion of a mobile element, such as those belonging to the LINE family of repetitive DNA. It is estimated that, in any individual, approximately 100 copies of a particular subclass of the LINE family in the genome are capable of movement by retrotransposition, introduced earlier. Such movement not only generates genetic diversity in our species (see Fig. 4-2) but can also cause disease by insertional mutagenesis. For example, in some patients with the severe bleeding disorder hemophilia A (Case 21), LINE sequences several kilobase pairs long are found to be inserted into an exon in the factor VIII gene, interrupting the coding sequence and inactivating the gene. LINE insertions throughout the genome are also common in colon cancer, reflecting retrotransposition in somatic cells (see Chapter 15).
As we discussed in the context of polymorphisms earlier in this chapter, duplications, deletions, and inversions of a larger segment of a single chromosome are predominantly the result of homologous recombination between DNA segments with high sequence homology (Fig. 4-5). Disorders arising as a result of such exchanges can be due to a change in the dosage of otherwise wild-type gene products when the homologous segments lie outside the genes themselves (see Chapter 6). Alternatively, such mutations can lead to a change in the nature of the encoded protein itself when recombination occurs between different genes within a gene family (see Chapter 11) or between genes on different chromosomes (see Chapter 15). Abnormal pairing and recombination between two similar sequences in opposite orientation on a single strand of DNA leads to inversion. For example, nearly half of all cases of hemophilia A are due to recombination that inverts a number of exons, thereby disrupting gene structure and rendering the gene incapable of encoding a normal gene product (see Fig. 4-5).
FIGURE 4-5 Inverted homologous sequences, labeled A and B, located 500 kb apart on the X chromosome, one upstream of the factor VIII gene, the other in an intron between exons 22 and 23 of the gene. Intrachromosomal mispairing and recombination results in inversion of exons 1 through 22 of the gene, thereby disrupting the gene and causing severe hemophilia.
The mutations in some disorders involve amplification of a simple nucleotide repeat sequence. For example, simple repeats such as (CCG)n, (CAG)n, or (CCTG)n located in the coding portion of an exon, in an untranslated region of an exon, or even in an intron may expand during gametogenesis, in what is referred to as a dynamic mutation, and interfere with normal gene expression or protein function. An expanded repeat in the coding region will generate an abnormal protein product, whereas repeat expansion in the untranslated regions or introns of a gene may interfere with transcription, mRNA processing, or translation. How dynamic mutations occur is not completely understood; they are conceptually similar to microsatellite polymorphisms but expand at a rate much higher than typically seen for microsatellite loci.
The involvement of simple nucleotide repeat expansions in disease is discussed further in Chapters 7 and 12. In disorders caused by dynamic mutations, marked parent-of-origin effects are well known and appear characteristic of the specific disease and/or the particular simple nucleotide repeat involved (see Chapter 12). Such differences may be due to fundamental biological differences between oogenesis and spermatogenesis but may also result from selection against gametes carrying certain repeat expansions.
Variation in Individual Genomes
The most extensive current inventory of the amount and type of variation to be expected in any given genome comes from the direct analysis of individual diploid human genomes. The first of such genome sequences, that of a male individual, was reported in 2007. Now, tens of thousands of individual genomes have been sequenced, some as part of large international research consortia exploring human genetic diversity in health and disease, and others in the context of clinical sequencing to determine the underlying basis of a disorder in particular patients.
What degree of genome variation does one detect in such studies? Individual human genomes typically carry 5 to 10 million SNPs, of which—depending in part on the population—as many as a quarter to a third are novel (see Box). This suggests that the number of SNPs described for our species is still incomplete, although presumably the fraction of such novel SNPs will decrease as more and more genomes from more and more populations are sequenced.
Within this variation lie variants with known, likely, or suspected clinical impact. Based on studies to date, each genome carries 50 to 100 variants that have previously been implicated in known inherited conditions. In addition, each genome carries thousands of nonsynonymous SNPs in protein-coding genes around the genome, some of which would be predicted to alter protein function. Each genome also carries approximately 200 to 300 likely loss-of-function mutations, some of which are present at both alleles of genes in that individual. Within the clinical setting, this realization has important implications for the interpretation of genome sequence data from patients, particularly when trying to predict the impact of mutations in genes of currently unknown function (see Chapter 16).
An interesting and unanticipated aspect of individual genome sequencing is that the reference human genome assembly still lacks considerable amounts of undocumented and unannotated DNA that are discovered in literally every individual genome being sequenced. These “new” sequences are revealed only as additional genomes are sequenced. Thus the complete collection of all human genome sequences to be found in our current population of 7 billion individuals, estimated to be 20 to 40 Mb larger than the extant reference assembly, still remains to be fully elucidated.
As impressive as the current inventory of human genetic diversity is, it is clear that we are still in a mode of discovery; no doubt millions of additional SNPs and other variants remain to be uncovered, as does the degree to which any of them might affect an individual's clinical status in the context of wellness and health care.
Variation Detected in a Typical Human Genome
Individuals vary greatly in a wide range of biological functions, determined in part by variation among their genomes. Any individual genome will contain the following:
• ≈5-10 million SNPs (varies by population)
• 25,000-50,000 rare variants (private mutations or seen previously in < 0.5% of individuals tested)
• ≈75 new base pair mutations not detected in parental genomes
• 3-7 new CNVs involving ≈500 kb of DNA
• 200,000-500,000 indels (1-50 bp) (varies by population)
• 500-1000 deletions 1-45 kb, overlapping ≈200 genes
• ≈150 in-frame indels
• ≈200-250 shifts in reading frame
• 10,000-12,000 synonymous SNPs
• 8,000-11,000 nonsynonymous SNPs in 4,000-5,000 genes
• 175-500 rare nonsynonymous variants
• 1 new nonsynonymous mutation
• ≈100 premature stop codons
• 40-50 splice site-disrupting variants
• 250-300 genes with likely loss-of-function variants
• ≈25 genes predicted to be completely inactivated
Clinical Sequencing Studies
In the context of genomic medicine, a key question is to what extent variation in the sequence and/or expression of one's genome influences the likelihood of disease onset, determines or signals the natural history of disease, and/or provides clues relevant to the management of disease. As just discussed, variation in one's constitutional genome can have a number of different direct or indirect effects on gene function.
Sequencing of entire genomes (so-called whole-genome sequencing) or of the subset of genomes that include all of the known coding exons (so-called whole-exome sequencing) has been introduced in a number of clinical settings, as will be discussed in greater detail in Chapter 16. Both whole-exome and whole-genome sequencing have been used to detect de novo mutations (both point mutations and CNVs) in a variety of conditions of complex and/or unknown etiology, including, for example, various neurodevelopmental or neuropsychiatric conditions, such as autism, schizophrenia, epilepsy, or intellectual disability and developmental delay.
Clinical sequencing studies can target either germline or somatic variants. In cancer, especially, various strategies have been used to search for somatic mutations in tumor tissue to identify genes potentially relevant to cancer progression (see Chapter 15).
Personal Genomics and the Role of the Consumer
The increasing ability to sequence individual genomes is not only enabling research and clinical laboratories, but also spawning a social and information revolution among consumers in the context of direct-to-consumer (DTC) genomics, in which testing of polymorphisms genome-wide and even sequencing of entire genomes is offered directly to potential customers, bypassing health professionals.
It is still largely unclear what degree of genome surveillance will be most useful for routine clinical practice, and this is likely to evolve rapidly in the case of specific conditions, as our knowledge increases, as professional practice guidelines are adopted, and as insurance companies react. Some groups have raised substantial concerns about privacy and about the need to regulate the industry. At the same time, however, other individuals are willing to make genome sequence data (and even medical information) available more or less publicly.
Attitudes in this area vary widely among professionals and the general public alike, depending on whether one views knowing the sequence of one's genome to be a fundamentally medical or personal activity. Critics of DTC testing and policymakers, in both the health industry and government, focus on issues of clinical utility, regulatory standards, medical oversight, availability of genetic counseling, and privacy. Proponents of DTC testing and even consumers themselves, on the other hand, focus more on freedom of information, individual rights, social and personal awareness, public education, and consumer empowerment.
The availability of individual genome information is increasingly a commercial commodity and a personal reality. In that sense, and notwithstanding or minimizing the significant scientific, ethical, and clinical issues that lie ahead, it is certain that individual genome sequences will be an active part of medical practice for today's students.
Impact of Mutation and Polymorphism
Although it will be self-evident to students of human genetics that new deleterious mutations or rare variants in the population may have clinical consequences, it may appear less obvious that commonpolymorphic variants can be medically relevant. For the proportion of polymorphic variation that occurs in the genes themselves, such loci can be studied by examining variation in the proteins encoded by the different alleles. It has long been estimated that any one individual is likely to carry two distinct alleles determining structurally differing polypeptides at approximately 20% of all protein-coding loci; when individuals from different geographic or ethnic groups are compared, an even greater fraction of proteins has been found to exhibit detectable polymorphism. In addition, even when the gene product is identical, the levels of expression of that product may be very different among different individuals, determined by a combination of genetic and epigenetic variation, as we saw in Chapter 3.
Thus a striking degree of biochemical individuality exists within the human species in its makeup of enzymes and other gene products. Furthermore, because the products of many of the encoded biochemical and regulatory pathways interact in functional and physiological networks, one may plausibly conclude that each individual, regardless of his or her state of health, has a unique, genetically determined chemical makeup and thus responds in a unique manner to environmental, dietary, and pharmacological influences. This concept of chemical individuality, first put forward over a century ago by Garrod, the remarkably prescient British physician introduced in Chapter 1, remains true today. The broad question of what is normal—an essential concept in human biology and in clinical medicine—remains very much an open one when it comes to the human genome.
The following chapters will explore this concept in detail, first in the context of genome and chromosome mutations (Chapters 5 and 6) and then in terms of gene mutations and polymorphisms that determine the inheritance of genetic disease (Chapter 7) and influence its likelihood in families and populations (Chapters 8 and 9).
Olson MV. Human genetic individuality. Ann Rev Genomics Hum Genet. 2012;13:1–27.
Strachan T, Read A. Human molecular genetics. ed 4. Garland Science: New York; 2010.
2012. The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.
Willard HF. The human genome: a window on human genetics, biology and medicine. Ginsburg GS, Willard HF. Genomic and personalized medicine. ed 2. Elsevier: New York; 2013.
References for Specific Topics
Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nature Rev Genet. 2011;12:363–376.
Bagnall RD, Waseem N, Green PM, Giannelli F. Recurrent inversion breaking intron 1 of the factor VIII gene is a frequent cause of severe hemophilia A. Blood. 2002;99:168–174.
Crow JF. The origins, patterns and implications of human spontaneous mutation. Nature Rev Genet. 2000;1:40–47.
Gardner RJ. A new estimate of the achondroplasia mutation rate. Clin Genet. 1977;11:31–38.
Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature. 2012;488:471–475.
Lappalainen T, Sammeth M, Friedlander MR, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511.
MacArthur DG, Balasubramanian S, Rrankish A, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828.
McBride CM, Wade CH, Kaphingst KA. Consumers’ view of direct-to-consumer genetic information. Ann Rev Genomics Hum Genet. 2010;11:427–446.
Stewart C, Kural D, Stromberg MP, et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 2011;7:e1002236.
Sun JX, Helgason A, Masson G, et al. A direct characterization of human mutation based on microsatellites. Nature Genet. 2012;44:1161–1165.
1. Polymorphism can arise from a variety of mechanisms, with different consequences. Describe and contrast the types of polymorphism that can have the following effects:
a. A change in dosage of a gene or genes
b. A change in the sequence of multiple amino acids in the product of a protein-coding gene
c. A change in the final structure of an RNA produced from a gene
d. A change in the order of genes in a region of a chromosome
e. No obvious effect
2. Aniridia is an eye disorder characterized by the complete or partial absence of the iris and is always present when a mutation occurs in the responsible gene. In one population, 41 children diagnosed with aniridia were born to parents of normal vision among 4.5 million births during a period of 40 years. Assuming that these cases were due to new mutations, what is the estimated mutation rate at the aniridia locus? On what assumptions is this estimate based, and why might this estimate be either too high or too low?
3. Which of the following types of polymorphism would be most effective for distinguishing two individuals from the general population: a SNP, a simple indel, or a microsatellite? Explain your reasoning.
4. Consider two cell lineages that differ from one another by a series of 100 cell divisions. Given the rate of mutation for different types of variation, how different would the genomes of those lineages be?
5. Compare the likely impact of each of the following on the overall rate of mutation detected in any given genome: age of the parents, hot spots of mutation, intrachromosomal homologous recombination, genetic variation in the parental genomes.