Gene Structure and Function
Over the past three decades, remarkable progress has been made in our understanding of the structure and function of genes and chromosomes. These advances have been aided by the applications of molecular genetics and genomics to many clinical problems, thereby providing the tools for a distinctive new approach to medical genetics. In this chapter, we present an overview of gene structure and function and the aspects of molecular genetics required for an understanding of the genetic and genomic approach to medicine. To supplement the information discussed here and in subsequent chapters, we provide additional material online to detail many of the experimental approaches of modern genetics and genomics that are becoming critical to the practice and understanding of human and medical genetics.
The increased knowledge of genes and of their organization in the genome has had an enormous impact on medicine and on our perception of human physiology. As 1980 Nobel laureate Paul Berg stated presciently at the dawn of this new era:
Just as our present knowledge and practice of medicine relies on a sophisticated knowledge of human anatomy, physiology, and biochemistry, so will dealing with disease in the future demand a detailed understanding of the molecular anatomy, physiology, and biochemistry of the human genome.… We shall need a more detailed knowledge of how human genes are organized and how they function and are regulated. We shall also have to have physicians who are as conversant with the molecular anatomy and physiology of chromosomes and genes as the cardiac surgeon is with the structure and workings of the heart.
Information Content of the Human Genome
How does the 3-billion-letter digital code of the human genome guide the intricacies of human anatomy, physiology, and biochemistry to which Berg referred? The answer lies in the enormous amplification and integration of information content that occurs as one moves from genes in the genome to their products in the cell and to the observable expression of that genetic information as cellular, morphological, clinical, or biochemical traits—what is termed the phenotype of the individual. This hierarchical expansion of information from the genome to phenotype includes a wide range of structural and regulatory RNA products, as well as protein products that orchestrate the many functions of cells, organs, and the entire organism, in addition to their interactions with the environment. Even with the essentially complete sequence of the human genome in hand, we still do not know the precise number of genes in the genome. Current estimates are that the genome contains approximately 20,000 protein-coding genes (see Box in Chapter 2), but this figure only begins to hint at the levels of complexity that emerge from the decoding of this digital information (Fig. 3-1).
FIGURE 3-1 The amplification of genetic information from genome to gene products to gene networks and ultimately to cellular function and phenotype. The genome contains both protein-coding genes (blue) and noncoding RNA (ncRNA) genes (red). Many genes in the genome use alternative coding information to generate multiple different products. Both small and large ncRNAs participate in gene regulation. Many proteins participate in multigene networks that respond to cellular signals in a coordinated and combinatorial manner, thus further expanding the range of cellular functions that underlie organismal phenotypes.
As introduced briefly in Chapter 2, the product of protein-coding genes is a protein whose structure ultimately determines its particular functions in the cell. But if there were a simple one-to-one correspondence between genes and proteins, we could have at most approximately 20,000 different proteins. This number seems insufficient to account for the vast array of functions that occur in human cells over the life span. The answer to this dilemma is found in two features of gene structure and function. First, many genes are capable of generating multiple different products, not just one (see Fig. 3-1). This process, discussed later in this chapter, is accomplished through the use of alternative coding segments in genes and through the subsequent biochemical modification of the encoded protein; these two features of complex genomes result in a substantial amplification of information content. Indeed, it has been estimated that in this way, these 20,000 human genes can encode many hundreds of thousands of different proteins, collectively referred to as the proteome. Second, individual proteins do not function by themselves. They form elaborate networks, involving many different proteins and regulatory RNAs that respond in a coordinated and integrated fashion to many different genetic, developmental, or environmental signals. The combinatorial nature of protein networks results in an even greater diversity of possible cellular functions.
Genes are located throughout the genome but tend to cluster in particular regions on particular chromosomes and to be relatively sparse in other regions or on other chromosomes. For example, chromosome 11, an approximately 135 million-bp (megabase pairs [Mb]) chromosome, is relatively gene-rich with approximately 1300 protein-coding genes (see Fig. 2-7). These genes are not distributed randomly along the chromosome, and their localization is particularly enriched in two chromosomal regions with gene density as high as one gene every 10 kb (Fig. 3-2). Some of the genes belong to families of related genes, as we will describe more fully later in this chapter. Other regions are gene-poor, and there are several so-called gene deserts of a million base pairs or more without any known protein-coding genes. Two caveats here: first, the process of gene identification and genome annotation remains very much an ongoing challenge; despite the apparent robustness of recent estimates, it is virtually certain that there are some genes, including clinically relevant genes, that are currently undetected or that display characteristics that we do not currently recognize as being associated with genes. And second, as mentioned in Chapter 2, many genes are not protein-coding; their products are functional RNA molecules (noncoding RNAs or ncRNAs; see Fig. 3-1) that play a variety of roles in the cell, many of which are only just being uncovered.
FIGURE 3-2 Gene content on chromosome 11, which consists of 135 Mb of DNA. A, The distribution of genes is indicated along the chromosome and is high in two regions of the chromosome and low in other regions. B, An expanded region from 5.15 to 5.35 Mb (measured from the short-arm telomere), which contains 10 known protein-coding genes, five belonging to the olfactory receptor (OR) gene family and five belonging to the globin gene family. C, The five β-like globin genes expanded further. SeeSources & Acknowledgments.
For genes located on the autosomes, there are two copies of each gene, one on the chromosome inherited from the mother and one on the chromosome inherited from the father. For most autosomal genes, both copies are expressed and generate a product. There are, however, a growing number of genes in the genome that are exceptions to this general rule and are expressed at characteristically different levels from the two copies, including some that, at the extreme, are expressed from only one of the two homologues. These examples of allelic imbalance are discussed in greater detail later in this chapter, as well as in Chapters 6 and 7.
The Central Dogma: DNA → RNA → Protein
How does the genome specify the functional complexity and diversity evident in Figure 3-1? As we saw in the previous chapter, genetic information is contained in DNA in the chromosomes within the cell nucleus. However, protein synthesis, the process through which information encoded in the genome is actually used to specify cellular functions, takes place in the cytoplasm. This compartmentalization reflects the fact that the human organism is a eukaryote. This means that human cells have a nucleus containing the genome, which is separated by a nuclear membrane from the cytoplasm. In contrast, in prokaryotes like the intestinal bacterium Escherichia coli, DNA is not enclosed within a nucleus. Because of the compartmentalization of eukaryotic cells, information transfer from the nucleus to the cytoplasm is a complex process that has been a focus of much attention among molecular and cellular biologists.
The molecular link between these two related types of information—the DNA code of genes and the amino acid code of protein—is ribonucleic acid (RNA). The chemical structure of RNA is similar to that of DNA, except that each nucleotide in RNA has a ribose sugar component instead of a deoxyribose; in addition, uracil (U) replaces thymine as one of the pyrimidine bases of RNA (Fig. 3-3). An additional difference between RNA and DNA is that RNA in most organisms exists as a single-stranded molecule, whereas DNA, as we saw in Chapter 2, exists as a double helix.
FIGURE 3-3 The pyrimidine uracil and the structure of a nucleotide in RNA. Note that the sugar ribose replaces the sugar deoxyribose of DNA. Compare with Figure 2-2.
The informational relationships among DNA, RNA, and protein are intertwined: genomic DNA directs the synthesis and sequence of RNA, RNA directs the synthesis and sequence of polypeptides, and specific proteins are involved in the synthesis and metabolism of DNA and RNA. This flow of information is referred to as the central dogma of molecular biology.
Genetic information is stored in the DNA of the genome by means of a code (the genetic code, discussed later) in which the sequence of adjacent bases ultimately determines the sequence of amino acids in the encoded polypeptide. First, RNA is synthesized from the DNA template through a process known as transcription. The RNA, carrying the coded information in a form called messenger RNA (mRNA), is then transported from the nucleus to the cytoplasm, where the RNA sequence is decoded, or translated, to determine the sequence of amino acids in the protein being synthesized. The process of translation occurs on ribosomes, which are cytoplasmic organelles with binding sites for all of the interacting molecules, including the mRNA, involved in protein synthesis. Ribosomes are themselves made up of many different structural proteins in association with specialized types of RNA known as ribosomal RNA (rRNA). Translation involves yet a third type of RNA, transfer RNA (tRNA), which provides the molecular link between the code contained in the base sequence of each mRNA and the amino acid sequence of the protein encoded by that mRNA.
Because of the interdependent flow of information represented by the central dogma, one can begin discussion of the molecular genetics of gene expression at any of its three informational levels: DNA, RNA, or protein. We begin by examining the structure of genes in the genome as a foundation for discussion of the genetic code, transcription, and translation.
Gene Organization and Structure
In its simplest form, a protein-coding gene can be visualized as a segment of a DNA molecule containing the code for the amino acid sequence of a polypeptide chain and the regulatory sequences necessary for its expression. This description, however, is inadequate for genes in the human genome (and indeed in most eukaryotic genomes) because few genes exist as continuous coding sequences. Rather, in the majority of genes, the coding sequences are interrupted by one or more noncoding regions (Fig. 3-4). These intervening sequences, called introns, are initially transcribed into RNA in the nucleus but are not present in the mature mRNA in the cytoplasm, because they are removed (“spliced out”) by a process we will discuss later. Thus information from the intronic sequences is not normally represented in the final protein product. Introns alternate with exons, the segments of genes that ultimately determine the amino acid sequence of the protein. In addition, the collection of coding exons in any particular gene is flanked by additional sequences that are transcribed but untranslated, called the 5′ and 3′ untranslated regions (see Fig. 3-4). Although a few genes in the human genome have no introns, most genes contain at least one and usually several introns. In many genes, the cumulative length of the introns makes up a far greater proportion of a gene's total length than do the exons. Whereas some genes are only a few kilobase pairs in length, others stretch on for hundreds of kilobase pairs. Also, few genes are exceptionally large; for example, the dystrophin gene on the X chromosome (mutations in which lead to Duchenne muscular dystrophy [Case 14]) spans more than 2 Mb, of which, remarkably, less than 1% consists of coding exons.
FIGURE 3-4 A, General structure of a typical human gene. Individual labeled features are discussed in the text. B, Examples of three medically important human genes. Different mutations in the β-globin gene, with three exons, cause a variety of important disorders of hemoglobin (Cases 42and44). Mutations in the BRCA1 gene (24 exons) are responsible for many cases of inherited breast or breast and ovarian cancer (Case 7). Mutations in the β-myosin heavy chain (MYH7) gene (40 exons) lead to inherited hypertrophic cardiomyopathy.
Structural Features of a Typical Human Gene
A range of features characterize human genes (see Fig. 3-4). In Chapters 1 and 2, we briefly defined gene in general terms. At this point, we can provide a molecular definition of a gene as a sequence of DNA that specifies production of a functional product, be it a polypeptide or a functional RNA molecule. A gene includes not only the actual coding sequences but also adjacent nucleotide sequences required for the proper expression of the gene—that is, for the production of normal mRNA or other RNA molecules in the correct amount, in the correct place, and at the correct time during development or during the cell cycle.
The adjacent nucleotide sequences provide the molecular “start” and “stop” signals for the synthesis of mRNA transcribed from the gene. Because the primary RNA transcript is synthesized in a 5′ to 3′ direction, the transcriptional start is referred to as the 5′ end of the transcribed portion of a gene (see Fig. 3-4). By convention, the genomic DNA that precedes the transcriptional start site in the 5′ direction is referred to as the “upstream” sequence, whereas DNA sequence located in the 3′ direction past the end of a gene is referred to as the “downstream” sequence. At the 5′ end of each gene lies a promoter region that includes sequences responsible for the proper initiation of transcription. Within this region are several DNA elements whose sequence is often conserved among many different genes; this conservation, together with functional studies of gene expression, indicates that these particular sequences play an important role in gene regulation. Only a subset of genes in the genome is expressed in any given tissue or at any given time during development. Several different types of promoter are found in the human genome, with different regulatory properties that specify the patterns as well as the levels of expression of a particular gene in different tissues and cell types, both during development and throughout the life span. Some of these properties are encoded in the genome, whereas others are specified by features of chromatin associated with those sequences, as discussed later in this chapter. Both promoters and other regulatory elements (located either 5′ or 3′ of a gene or in its introns) can be sites of mutation in genetic disease that can interfere with the normal expression of a gene. These regulatory elements, including enhancers, insulators, and locus control regions, are discussed more fully later in this chapter. Some of these elements lie a significant distance away from the coding portion of a gene, thus reinforcing the concept that the genomic environment in which a gene resides is an important feature of its evolution and regulation.
The 3′ untranslated region contains a signal for the addition of a sequence of adenosine residues (the so-called polyA tail) to the end of the mature RNA. Although it is generally accepted that such closely neighboring regulatory sequences are part of what is called a gene, the precise dimensions of any particular gene will remain somewhat uncertain until the potential functions of more distant sequences are fully characterized.
Many genes belong to gene families, which share closely related DNA sequences and encode polypeptides with closely related amino acid sequences.
Members of two such gene families are located within a small region on chromosome 11 (see Fig. 3-2) and illustrate a number of features that characterize gene families in general. One small and medically important gene family is composed of genes that encode the protein chains found in hemoglobins. The β-globin gene cluster on chromosome 11 and the related α-globin gene cluster on chromosome 16 are believed to have arisen by duplication of a primitive precursor gene approximately 500 million years ago. These two clusters contain multiple genes coding for closely related globin chains expressed at different developmental stages, from embryo to adult. Each cluster is believed to have evolved by a series of sequential gene duplication events within the past 100 million years. The exon-intron patterns of the functional globin genes have been remarkably conserved during evolution; each of the functional globin genes has two introns at similar locations (see the β-globin gene in Fig. 3-4), although the sequences contained within the introns have accumulated far more nucleotide base changes over time than have the coding sequences of each gene. The control of expression of the various globin genes, in the normal state as well as in the many inherited disorders of hemoglobin, is considered in more detail both later in this chapter and in Chapter 11.
The second gene family shown in Figure 3-2 is the family of olfactory receptor (OR) genes. There are estimated to be as many as 1000 OR genes in the genome. ORs are responsible for our acute sense of smell that can recognize and distinguish thousands of structurally diverse chemicals. OR genes are found throughout the genome on nearly every chromosome, although more than half are found on chromosome 11, including a number of family members near the β-globin cluster.
Within both the β-globin and OR gene families are sequences that are related to the functional globin and OR genes but that do not produce any functional RNA or protein product. DNA sequences that closely resemble known genes but are nonfunctional are called pseudogenes, and there are tens of thousands of pseudogenes related to many different genes and gene families located all around the genome. Pseudogenes are of two general types, processed and nonprocessed. Nonprocessed pseudogenes are thought to be byproducts of evolution, representing “dead” genes that were once functional but are now vestigial, having been inactivated by mutations in critical coding or regulatory sequences. In contrast to nonprocessed pseudogenes, processed pseudogenes are pseudogenes that have been formed, not by mutation, but by a process called retrotransposition, which involves transcription, generation of a DNA copy of the mRNA (a so-called cDNA) by reverse transcription, and finally integration of such DNA copies back into the genome at a location usually quite distant from the original gene. Because such pseudogenes are created by retrotransposition of a DNA copy of processed mRNA, they lack introns and are not necessarily or usually on the same chromosome (or chromosomal region) as their progenitor gene. In many gene families, there are as many or even more pseudogenes as there are functional gene members.
Noncoding RNA Genes
As just discussed, many genes are protein coding and are transcribed into mRNAs that are ultimately translated into their respective proteins; their products comprise the enzymes, structural proteins, receptors, and regulatory proteins that are found in various human tissues and cell types. However, as introduced briefly in Chapter 2, there are additional genes whose functional product appears to be the RNA itself (see Fig. 3-1). These so-called noncoding RNAs (ncRNAs) have a range of functions in the cell, although many do not as yet have any identified function. By current estimates, there are some 20,000 to 25,000 ncRNA genes in addition to the approximately 20,000 protein-coding genes that we introduced earlier. Thus the collection of ncRNAs represents approximately half of all identified human genes. Chromosome 11, for example, in addition to its 1300 protein-coding genes, has an estimated 1000 ncRNA genes.
Some of the types of ncRNA play largely generic roles in cellular infrastructure, including the tRNAs and rRNAs involved in translation of mRNAs on ribosomes, other RNAs involved in control of RNA splicing, and small nucleolar RNAs (snoRNAs) involved in modifying rRNAs. Additional ncRNAs can be quite long (thus sometimes called long ncRNAs, or lncRNAs) and play roles in gene regulation, gene silencing, and human disease, as we explore in more detail later in this chapter.
A particular class of small RNAs of growing importance are the microRNAs (miRNAs), ncRNAs of only approximately 22 bases in length that suppress translation of target genes by binding to their respective mRNAs and regulating protein production from the target transcript(s). Well over 1000 miRNA genes have been identified in the human genome; some are evolutionarily conserved, whereas others appear to be of quite recent origin during evolution. Some miRNAs have been shown to down-regulate hundreds of mRNAs each, with different combinations of target RNAs in different tissues; combined, the miRNAs are thus predicted to control the activity of as many as 30% of all protein-coding genes in the genome.
Although this is a fast-moving area of genome biology, mutations in several ncRNA genes have already been implicated in human diseases, including cancer, developmental disorders, and various diseases of both early and adult onset (see Box).
Noncoding Rnas and Disease
The importance of various types of ncRNAs for medicine is underscored by their roles in a range of human diseases, from early developmental syndromes to adult-onset disorders.
• Deletion of a cluster of miRNA genes on chromosome 13 leads to a form of Feingold syndrome, a developmental syndrome of skeletal and growth defects, including microcephaly, short stature, and digital anomalies.
• Mutations in the miRNA gene MIR96, in the region of the gene critical for the specificity of recognition of its target mRNA(s), can result in progressive hearing loss in adults.
• Aberrant levels of certain classes of miRNAs have been reported in a wide variety of cancers, central nervous system disorders, and cardiovascular disease (see Chapter 15).
• Deletion of clusters of snoRNA genes on chromosome 15 results in Prader-Willi syndrome, a disorder characterized by obesity, hypogonadism, and cognitive impairment (see Chapter 6).
• Abnormal expression of a specific lncRNA on chromosome 12 has been reported in patients with a pregnancy-associated disease called HELLP syndrome.
• Deletion, abnormal expression, and/or structural abnormalities in different lncRNAs with roles in long-range regulation of gene expression and genome function underlie a variety of disorders involving telomere length maintenance, monoallelic expression of genes in specific regions of the genome, and X chromosome dosage (see Chapter 6).
Fundamentals of Gene Expression
For genes that encode proteins, the flow of information from gene to polypeptide involves several steps (Fig. 3-5). Initiation of transcription of a gene is under the influence of promoters and other regulatory elements, as well as specific proteins known as transcription factors, which interact with specific sequences within these regions and determine the spatial and temporal pattern of expression of a gene. Transcription of a gene is initiated at the transcriptional “start” site on chromosomal DNA at the beginning of a 5′ transcribed but untranslated region (called the 5′ UTR), just upstream from the coding sequences, and continues along the chromosome for anywhere from several hundred base pairs to more than a million base pairs, through both introns and exons and past the end of the coding sequences. After modification at both the 5′ and 3′ ends of the primary RNA transcript, the portions corresponding to introns are removed, and the segments corresponding to exons are spliced together, a process called RNA splicing. After splicing, the resulting mRNA (containing a central segment that is now colinear with the coding portions of the gene) is transported from the nucleus to the cytoplasm, where the mRNA is finally translated into the amino acid sequence of the encoded polypeptide. Each of the steps in this complex pathway is subject to error, and mutations that interfere with the individual steps have been implicated in a number of inherited disorders (see Chapters 11 and 12).
FIGURE 3-5 Flow of information from DNA to RNA to protein for a hypothetical gene with three exons and two introns. Within the exons, purple indicates the coding sequences. Steps include transcription, RNA processing and splicing, RNA transport from the nucleus to the cytoplasm, and translation.
Transcription of protein-coding genes by RNA polymerase II (one of several classes of RNA polymerases) is initiated at the transcriptional start site, the point in the 5′ UTR that corresponds to the 5′ end of the final RNA product (see Figs. 3-4 and 3-5). Synthesis of the primary RNA transcript proceeds in a 5′ to 3′ direction, whereas the strand of the gene that is transcribed and that serves as the template for RNA synthesis is actually read in a 3′ to 5′ direction with respect to the direction of the deoxyribose phosphodiester backbone (see Fig. 2-3). Because the RNA synthesized corresponds both in polarity and in base sequence (substituting U for T) to the 5′ to 3′ strand of DNA, this 5′ to 3′ strand of nontranscribed DNA is sometimes called the coding, or sense, DNA strand. The 3′ to 5′ strand of DNA that is used as a template for transcription is then referred to as the noncoding, or antisense, strand. Transcription continues through both intronic and exonic portions of the gene, beyond the position on the chromosome that eventually corresponds to the 3′ end of the mature mRNA. Whether transcription ends at a predetermined 3′ termination point is unknown.
The primary RNA transcript is processed by addition of a chemical “cap” structure to the 5′ end of the RNA and cleavage of the 3′ end at a specific point downstream from the end of the coding information. This cleavage is followed by addition of a polyA tail to the 3′ end of the RNA; the polyA tail appears to increase the stability of the resulting polyadenylated RNA. The location of the polyadenylation point is specified in part by the sequence AAUAAA (or a variant of this), usually found in the 3′ untranslated portion of the RNA transcript. All of these post-transcriptional modifications take place in the nucleus, as does the process of RNA splicing. The fully processed RNA, now called mRNA, is then transported to the cytoplasm, where translation takes place (see Fig. 3-5).
Translation and the Genetic Code
In the cytoplasm, mRNA is translated into protein by the action of a variety of short RNA adaptor molecules, the tRNAs, each specific for a particular amino acid. These remarkable molecules, each only 70 to 100 nucleotides long, have the job of bringing the correct amino acids into position along the mRNA template, to be added to the growing polypeptide chain. Protein synthesis occurs on ribosomes, macromolecular complexes made up of rRNA (encoded by the 18S and 28S rRNA genes), and several dozen ribosomal proteins (see Fig. 3-5).
The key to translation is a code that relates specific amino acids to combinations of three adjacent bases along the mRNA. Each set of three bases constitutes a codon, specific for a particular amino acid (Table 3-1). In theory, almost infinite variations are possible in the arrangement of the bases along a polynucleotide chain. At any one position, there are four possibilities (A, T, C, or G); thus, for three bases, there are 43, or 64, possible triplet combinations. These 64 codons constitute the genetic code.
The Genetic Code
Stop, Termination codon.
Codons are shown in terms of mRNA, which are complementary to the corresponding DNA codons.
Because there are only 20 amino acids and 64 possible codons, most amino acids are specified by more than one codon; hence the code is said to be degenerate. For instance, the base in the third position of the triplet can often be either purine (A or G) or either pyrimidine (T or C) or, in some cases, any one of the four bases, without altering the coded message (see Table 3-1). Leucine and arginine are each specified by six codons. Only methionine and tryptophan are each specified by a single, unique codon. Three of the codons are called stop (or nonsense) codons because they designate termination of translation of the mRNA at that point.
Translation of a processed mRNA is always initiated at a codon specifying methionine. Methionine is therefore the first encoded (amino-terminal) amino acid of each polypeptide chain, although it is usually removed before protein synthesis is completed. The codon for methionine (the initiator codon, AUG) establishes the reading frame of the mRNA; each subsequent codon is read in turn to predict the amino acid sequence of the protein.
The molecular links between codons and amino acids are the specific tRNA molecules. A particular site on each tRNA forms a three-base anticodon that is complementary to a specific codon on the mRNA. Bonding between the codon and anticodon brings the appropriate amino acid into the next position on the ribosome for attachment, by formation of a peptide bond, to the carboxyl end of the growing polypeptide chain. The ribosome then slides along the mRNA exactly three bases, bringing the next codon into line for recognition by another tRNA with the next amino acid. Thus proteins are synthesized from the amino terminus to the carboxyl terminus, which corresponds to translation of the mRNA in a 5′ to 3′ direction.
As mentioned earlier, translation ends when a stop codon (UGA, UAA, or UAG) is encountered in the same reading frame as the initiator codon. (Stop codons in either of the other unused reading frames are not read, and therefore have no effect on translation.) The completed polypeptide is then released from the ribosome, which becomes available to begin synthesis of another protein.
Increasing Functional Diversity of Proteins
Many proteins undergo extensive post-translational packaging and processing as they adopt their final functional state (see Chapter 12). The polypeptide chain that is the primary translation product folds on itself and forms intramolecular bonds to create a specific three-dimensional structure that is determined by the amino acid sequence itself. Two or more polypeptide chains, products of the same gene or of different genes, may combine to form a single multiprotein complex. For example, two α-globin chains and two β-globin chains associate noncovalently to form a tetrameric hemoglobin molecule (see Chapter 11). The protein products may also be modified chemically by, for example, addition of methyl groups, phosphates, or carbohydrates at specific sites. These modifications can have significant influence on the function or abundance of the modified protein. Other modifications may involve cleavage of the protein, either to remove specific amino-terminal sequences after they have functioned to direct a protein to its correct location within the cell (e.g., proteins that function within mitochondria) or to split the molecule into smaller polypeptide chains. For example, the two chains that make up mature insulin, one 21 and the other 30 amino acids long, are originally part of an 82–amino acid primary translation product called proinsulin.
Transcription of the Mitochondrial Genome
The previous sections described fundamentals of gene expression for genes contained in the nuclear genome. The mitochondrial genome has its own transcription and protein-synthesis system. A specialized RNA polymerase, encoded in the nuclear genome, is used to transcribe the 16-kb mitochondrial genome, which contains two related promoter sequences, one for each strand of the circular genome. Each strand is transcribed in its entirety, and the mitochondrial transcripts are then processed to generate the various individual mitochondrial mRNAs, tRNAs, and rRNAs.
Gene Expression in Action
The flow of information outlined in the preceding sections can best be appreciated by reference to a particular well-studied gene, the β-globin gene. The β-globin chain is a 146–amino acid polypeptide, encoded by a gene that occupies approximately 1.6 kb on the short arm of chromosome 11. The gene has three exons and two introns (see Fig. 3-4). The β-globin gene, as well as the other genes in the β-globin cluster (see Fig. 3-2), is transcribed in a centromere-to-telomere direction. The orientation, however, is different for different genes in the genome and depends on which strand of the chromosomal double helix is the coding strand for a particular gene.
DNA sequences required for accurate initiation of transcription of the β-globin gene are located in the promoter within approximately 200 bp upstream from the transcription start site. The double-stranded DNA sequence of this region of the β-globin gene, the corresponding RNA sequence, and the translated sequence of the first 10 amino acids are depicted in Figure 3-6 to illustrate the relationships among these three information levels. As mentioned previously, it is the 3′ to 5′ strand of the DNA that serves as the template and is actually transcribed, but it is the 5′ to 3′ strand of DNA that directly corresponds to the 5′ to 3′ sequence of the mRNA (and, in fact, is identical to it except that U is substituted for T). Because of this correspondence, the 5′ to 3′ DNA strand of a gene (i.e., the strand that is not transcribed) is the strand generally reported in the scientific literature or in databases.
FIGURE 3-6 Structure and nucleotide sequence of the 5′ end of the human β-globin gene on the short arm of chromosome 11. Transcription of the 3′ to 5′ (lower) strand begins at the indicated start site to produce β-globin messenger RNA (mRNA). The translational reading frame is determined by the AUG initiator codon (); subsequent codons specifying amino acids are indicated in blue. The other two potential frames are not used.
In accordance with this convention, the complete sequence of approximately 2.0 kb of chromosome 11 that includes the β-globin gene is shown in Figure 3-7. (It is sobering to reflect that a printout of the entire human genome at this scale would require over 300 books the size of this textbook!) Within these 2.0 kb are contained most, but not all, of the sequence elements required to encode and regulate the expression of this gene. Indicated in Figure 3-7 are many of the important structural features of the β-globin gene, including conserved promoter sequence elements, intron and exon boundaries, 5′ and 3′ UTRs, RNA splice sites, the initiator and termination codons, and the polyadenylation signal, all of which are known to be mutated in various inherited defects of the β-globin gene (see Chapter 11).
FIGURE 3-7 Nucleotide sequence of the complete human β-globin gene. The sequence of the 5′ to 3′ strand of the gene is shown. Tan areas with capital letters represent exonic sequences corresponding to mature mRNA. Lowercase letters indicate introns and flanking sequences. The CAT and TATA box sequences in the 5′ flanking region are indicated in brown. The GT and AG dinucleotides important for RNA splicing at the intron-exon junctions and the AATAAA signal important for addition of a polyA tail also are highlighted. The ATG initiator codon (AUG in mRNA) and the TAA stop codon (UAA in mRNA) are shown in red letters. The amino acid sequence of β-globin is shown above the coding sequence; the three-letter abbreviations in Table 3-1 are used here. SeeSources & Acknowledgments.
Initiation of Transcription
The β-globin promoter, like many other gene promoters, consists of a series of relatively short functional elements that interact with specific regulatory proteins (generically called transcription factors) that control transcription, including, in the case of the globin genes, those proteins that restrict expression of these genes to erythroid cells, the cells in which hemoglobin is produced. There are well over a thousand sequence-specific, DNA-binding transcription factors in the genome, some of which are ubiquitous in their expression, whereas others are cell type– or tissue-specific.
One important promoter sequence found in many, but not all, genes is the TATA box, a conserved region rich in adenines and thymines that is approximately 25 to 30 bp upstream of the start site of transcription (see Figs. 3-4 and 3-7). The TATA box appears to be important for determining the position of the start of transcription, which in the β-globin gene is approximately 50 bp upstream from the translation initiation site (see Fig. 3-6). Thus in this gene, there are approximately 50 bp of sequence at the 5′ end that are transcribed but are not translated; in other genes, the 5′ UTR can be much longer and can even be interrupted by one or more introns. A second conserved region, the so-called CAT box (actually CCAAT), is a few dozen base pairs farther upstream (see Fig. 3-7). Both experimentally induced and naturally occurring mutations in either of these sequence elements, as well as in other regulatory sequences even farther upstream, lead to a sharp reduction in the level of transcription, thereby demonstrating the importance of these elements for normal gene expression. Many mutations in these regulatory elements have been identified in patients with the hemoglobin disorder β-thalassemia (see Chapter 11).
Not all gene promoters contain the two specific elements just described. In particular, genes that are constitutively expressed in most or all tissues (so-called housekeeping genes) often lack the CAT and TATA boxes, which are more typical of tissue-specific genes. Promoters of many housekeeping genes contain a high proportion of cytosines and guanines in relation to the surrounding DNA (see the promoter of the BRCA1 breast cancer gene in Fig. 3-4). Such CG-rich promoters are often located in regions of the genome called CpG islands, so named because of the unusually high concentration of the dinucleotide 5′-CpG-3′ (the p representing the phosphate group between adjacent bases; see Fig. 2-3) that stands out from the more general AT-rich genomic landscape. Some of the CG-rich sequence elements found in these promoters are thought to serve as binding sites for specific transcription factors. CpG islands are also important because they are targets for DNA methylation. Extensive DNA methylation at CpG islands is usually associated with repression of gene transcription, as we will discuss further later in the context of chromatin and its role in the control of gene expression.
Transcription by RNA polymerase II (RNA pol II) is subject to regulation at multiple levels, including binding to the promoter, initiation of transcription, unwinding of the DNA double helix to expose the template strand, and elongation as RNA pol II moves along the DNA. Although some silenced genes are devoid of RNA pol II binding altogether, consistent with their inability to be transcribed in a given cell type, others have RNA pol II poised bidirectionally at the transcriptional start site, perhaps as a means of fine-tuning transcription in response to particular cellular signals.
In addition to the sequences that constitute a promoter itself, there are other sequence elements that can markedly alter the efficiency of transcription. The best characterized of these “activating” sequences are called enhancers. Enhancers are sequence elements that can act at a distance from a gene (often several or even hundreds of kilobases away) to stimulate transcription. Unlike promoters, enhancers are both position and orientation independent and can be located either 5′ or 3′ of the transcription start site. Specific enhancer elements function only in certain cell types and thus appear to be involved in establishing the tissue specificity or level of expression of many genes, in concert with one or more transcription factors. In the case of the β-globin gene, several tissue-specific enhancers are present both within the gene itself and in its flanking regions. The interaction of enhancers with specific regulatory proteins leads to increased levels of transcription.
Normal expression of the β-globin gene during development also requires more distant sequences called the locus control region (LCR), located upstream of the ε-globin gene (see Fig. 3-2), which is required for establishing the proper chromatin context needed for appropriate high-level expression. As expected, mutations that disrupt or delete either enhancer or LCR sequences interfere with or prevent β-globin gene expression (see Chapter 11).
The primary RNA transcript of the β-globin gene contains two introns, approximately 100 and 850 bp in length, that need to be removed and the remaining RNA segments joined together to form the mature mRNA. The process of RNA splicing, described generally earlier, is typically an exact and highly efficient one; 95% of β-globin transcripts are thought to be accurately spliced to yield functional globin mRNA. The splicing reactions are guided by specific sequences in the primary RNA transcript at both the 5′ and the 3′ ends of introns. The 5′ sequence consists of nine nucleotides, of which two (the dinucleotide GT [GU in the RNA transcript] located in the intron immediately adjacent to the splice site) are virtually invariant among splice sites in different genes (see Fig. 3-7). The 3′ sequence consists of approximately a dozen nucleotides, of which, again, two—the AG located immediately 5′ to the intron-exon boundary—are obligatory for normal splicing. The splice sites themselves are unrelated to the reading frame of the particular mRNA. In some instances, as in the case of intron 1 of the β-globin gene, the intron actually splits a specific codon (see Fig. 3-7).
The medical significance of RNA splicing is illustrated by the fact that mutations within the conserved sequences at the intron-exon boundaries commonly impair RNA splicing, with a concomitant reduction in the amount of normal, mature β-globin mRNA; mutations in the GT or AG dinucleotides mentioned earlier invariably eliminate normal splicing of the intron containing the mutation. Representative splice site mutations identified in patients with β-thalassemia are discussed in detail in Chapter 11.
As just discussed, when introns are removed from the primary RNA transcript by RNA splicing, the remaining exons are spliced together to generate the final, mature mRNA. However, for most genes, the primary transcript can follow multiple alternative splicing pathways, leading to the synthesis of multiple related but different mRNAs, each of which can be subsequently translated to generate different protein products (see Fig. 3-1). Some of these alternative events are highly tissue- or cell type–specific, and, to the extent that such events are determined by primary sequence, they are subject to allelic variation between different individuals. Nearly all human genes undergo alternative splicing to some degree, and it has been estimated that there are an average of two or three alternative transcripts per gene in the human genome, thus greatly expanding the information content of the human genome beyond the approximately 20,000 protein-coding genes. The regulation of alternative splicing appears to play a particularly impressive role during neuronal development, where it may contribute to generating the high levels of functional diversity needed in the nervous system. Consistent with this, susceptibility to a number of neuropsychiatric conditions has been associated with shifts or disruption of alternative splicing patterns.
The mature β-globin mRNA contains approximately 130 bp of 3′ untranslated material (the 3′ UTR) between the stop codon and the location of the polyA tail (see Fig. 3-7). As in other genes, cleavage of the 3′ end of the mRNA and addition of the polyA tail is controlled, at least in part, by an AAUAAA sequence approximately 20 bp before the polyadenylation site. Mutations in this polyadenylation signal in patients with β-thalassemia document the importance of this signal for proper 3′ cleavage and polyadenylation (see Chapter 11). The 3′ UTR of some genes can be up to several kb in length. Other genes have a number of alternative polyadenylation sites, selection among which may influence the stability of the resulting mRNA and thus the steady-state level of each mRNA.
RNA Editing and RNA-DNA Sequence Differences
Recent findings suggest that the conceptual principle underlying the central dogma—that RNA and protein sequences reflect the underlying genomic sequence—may not always hold true. RNA editing to change the nucleotide sequence of the mRNA has been demonstrated in a number of organisms, including humans. This process involves deamination of adenosine at particular sites, converting an A in the DNA sequence to an inosine in the resulting RNA; this is then read by the translational machinery as a G, leading to changes in gene expression and protein function, especially in the nervous system. More widespread RNA-DNA differences involving other bases (with corresponding changes in the encoded amino acid sequence) have also been reported, at levels that vary among individuals. Although the mechanism(s) and clinical relevance of these events remain controversial, they illustrate the existence of a range of processes capable of increasing transcript and proteome diversity.
Epigenetic and Epigenomic Aspects of Gene Expression
Given the range of functions and fates that different cells in any organism must adopt over its lifetime, it is apparent that not all genes in the genome can be actively expressed in every cell at all times. As important as completion of the Human Genome Project has been for contributing to our understanding of human biology and disease, identifying the genomic sequences and features that direct developmental, spatial, and temporal aspects of gene expression remains a formidable challenge. Several decades of work in molecular biology have defined critical regulatory elements for many individual genes, as we saw in the previous section, and more recent attention has been directed toward performing such studies on a genome-wide scale.
In Chapter 2, we introduced general aspects of chromatin that package the genome and its genes in all cells. Here, we explore the specific characteristics of chromatin that are associated with active or repressed genes as a step toward identifying the regulatory code for expression of the human genome. Such studies focus on reversible changes in the chromatin landscape as determinants of gene function rather than on changes to the genome sequence itself and are thus called epigenetic or, when considered in the context of the entire genome, epigenomic (Greek epi-, over or upon).
The field of epigenetics is growing rapidly and is the study of heritable changes in cellular function or gene expression that can be transmitted from cell to cell (and even generation to generation) as a result of chromatin-based molecular signals (Fig. 3-8). Complex epigenetic states can be established, maintained, and transmitted by a variety of mechanisms: modifications to the DNA, such as DNA methylation;numerous histone modifications that alter chromatin packaging or access; and substitution of specialized histone variants that mark chromatin associated with particular sequences or regions in the genome. These chromatin changes can be highly dynamic and transient, capable of responding rapidly and sensitively to changing needs in the cell, or they can be long lasting, capable of being transmitted through multiple cell divisions or even to subsequent generations. In either instance, the key concept is that epigenetic mechanisms do not alter the underlying DNA sequence, and this distinguishes them from genetic mechanisms, which are sequence based. Together, the epigenetic marks and the DNA sequence make up the set of signals that guide the genome to express its genes at the right time, in the right place, and in the right amounts.
FIGURE 3-8 Schematic representation of chromatin and three major epigenetic mechanisms: DNA methylation at CpG dinucleotides, associated with gene repression; various modifications (indicated by different colors) on histone tails, associated with either gene expression or repression; and various histone variants that mark specific regions of the genome, associated with specific functions required for chromosome stability or genome integrity. Not to scale.
Increasing evidence points to a role for epigenetic changes in human disease in response to environmental or lifestyle influences. The dynamic and reversible nature of epigenetic changes permits a level of adaptability or plasticity that greatly exceeds the capacity of DNA sequence alone and thus is relevant both to the origins and potential treatment of disease. A number of large-scale epigenomics projects (akin to the original Human Genome Project) have been initiated to catalogue DNA methylation sites genome-wide (the so-called methylome), to evaluate CpG landscapes across the genome, to discover new histone variants and modification patterns in various tissues, and to document positioning of nucleosomes around the genome in different cell types, and in samples from both asymptomatic individuals and those with cancer or other diseases. These analyses are part of a broad effort (called the ENCODE Project, for Encyclopedia of DNA Elements) to explore epigenetic patterns in chromatin genome-wide in order to better understand control of gene expression in different tissues or disease states.
DNA methylation involves the modification of cytosine bases by methylation of the carbon at the fifth position in the pyrimidine ring (Fig. 3-9). Extensive DNA methylation is a mark of repressed genes and is a widespread mechanism associated with the establishment of specific programs of gene expression during cell differentiation and development. Typically, DNA methylation occurs on the C of CpG dinucleotides (see Fig. 3-8) and inhibits gene expression by recruitment of specific methyl-CpG–binding proteins that, in turn, recruit chromatin-modifying enzymes to silence transcription. The presence of 5-methylcytosine (5-mC) is considered to be a stable epigenetic mark that can be faithfully transmitted through cell division; however, altered methylation states are frequently observed in cancer, with hypomethylation of large genomic segments or with regional hypermethylation (particularly at CpG islands) in others (see Chapter 15).
FIGURE 3-9 The modified DNA bases, 5-methylcytosine and 5-hydroxymethylcytosine. Compare to the structure of cytosine in Figure 2-2. The added methyl and hydroxymethyl groups are boxed in purple. The atoms in the pyrimidine rings are numbered 1 to 6 to indicate the 5-carbon.
Extensive demethylation occurs during germ cell development and in the early stages of embryonic development, consistent with the need to “re-set” the chromatin environment and restore totipotency or pluripotency of the zygote and of various stem cell populations. Although the details are still incompletely understood, these reprogramming steps appear to involve the enzymatic conversion of 5-mC to 5-hydroxymethylcytosine (5-hmC; see Fig. 3-9), as a likely intermediate in the demethylation of DNA. Overall, 5-mC levels are stable across adult tissues (approximately 5% of all cytosines), whereas 5-hmC levels are much lower and much more variable (0.1% to 1% of all cytosines). Interestingly, although 5-hmC is widespread in the genome, its highest levels are found in known regulatory regions, suggesting a possible role in the regulation of specific promoters and enhancers.
A second class of epigenetic signals consists of an extensive inventory of modifications to any of the core histone types, H2A, H2B, H3, and H4 (see Chapter 2). Such modifications include histone methylation, phosphorylation, acetylation, and others at specific amino acid residues, mostly located on the N-terminal “tails” of histones that extend out from the core nucleosome itself (see Fig. 3-8). These epigenetic modifications are believed to influence gene expression by affecting chromatin compaction or accessibility and by signaling protein complexes that—depending on the nature of the signal—activate or silence gene expression at that site. There are dozens of modified sites that can be experimentally queried genome-wide by using antibodies that recognize specifically modified sites—for example, histone H3 methylated at lysine position 9 (H3K9 methylation, using the one-letter abbreviation K for lysine; see Table 3-1) or histone H3 acetylated at lysine position 27 (H3K27 acetylation). The former is a repressive mark associated with silent regions of the genome, whereas the latter is a mark for activating regulatory regions.
Specific patterns of different histone modifications are associated with promoters, enhancers, or the body of genes in different tissues and cell types. The ENCODE Project, introduced earlier, examined 12 of the most common modifications in nearly 50 different cell types and integrated the individual chromatin profiles to assign putative functional attributes to well over half of the human genome. This finding implies that much more of the genome plays a role, directly or indirectly, in determining the varied patterns of gene expression that distinguish cell types than previously inferred from the fact that less than 2% of the genome is “coding” in a traditional sense.
The histone modifications just discussed involve modification of the core histones themselves, which are all encoded by multigene clusters in a few locations in the genome. In contrast, the many dozens of histone variants are products of entirely different genes located elsewhere in the genome, and their amino acid sequences are distinct from, although related to, those of the canonical histones.
Different histone variants are associated with different functions, and they replace—all or in part—the related member of the core histones found in typical nucleosomes to generate specialized chromatin structures (see Fig. 3-8). Some variants mark specific regions or loci in the genome with highly specialized functions; for example, the CENP-A histone is a histone H3-related variant that is found exclusively at functional centromeres in the genome and contributes to essential features of centromeric chromatin that mark the location of kinetochores along the chromosome fiber. Other variants are more transient and mark regions of the genome with particular attributes; for example, H2A.X is a histone H2A variant involved in the response to DNA damage to mark regions of the genome that require DNA repair.
In contrast to the impression one gets from viewing the genome as a linear string of sequence (see Fig. 3-7), the genome adopts a highly ordered and dynamic arrangement within the space of the nucleus, correlated with and likely guided by the epigenetic and epigenomic signals just discussed. This three-dimensional landscape is highly predictive of the map of all expressed sequences in any given cell type (the transcriptome) and reflects dynamic changes in chromatin architecture at different levels (Fig. 3-10). First, large chromosomal domains (up to millions of base pairs in size) can exhibit coordinated patterns of gene expression at the chromosome level, involving dynamic interactions between different intrachromosomal and interchromosomal points of contact within the nucleus. At a finer level, technical advances to map and sequence points of contact around the genome in the context of three-dimensional space have pointed to ordered loops of chromatin that position and orient genes precisely, exposing or blocking critical regulatory regions for access by RNA pol II, transcription factors, and other regulators. Lastly, specific and dynamic patterns of nucleosome positioning differ among cell types and tissues in the face of changing environmental and developmental cues (see Fig. 3-10). The biophysical, epigenomic, and/or genomic properties that facilitate or specify the orderly and dynamic packaging of each chromosome during each cell cycle, without reducing the genome to a disordered tangle within the nucleus, remain a marvel of landscape engineering.
FIGURE 3-10 Three-dimensional architecture and dynamic packaging of the genome, viewed at increasing levels of resolution. A, Within interphase nuclei, each chromosome occupies a particular territory, represented by the different colors. B, Chromatin is organized into large subchromosomal domains within each territory, with loops that bring certain sequences and genes into proximity with each other, with detectable intrachromosomal and interchromosomal interactions. C, Loops bring long-range regulatory elements (e.g., enhancers or locus-control regions) into association with promoters, leading to active transcription and gene expression. D, Positioning of nucleosomes along the chromatin fiber provides access to specific DNA sequences for binding by transcription factors and other regulatory proteins.
Gene Expression as the Integration of Genomic and Epigenomic Signals
The gene expression program of a cell encompasses the specific subset of the approximately 20,000 protein-coding genes in the genome that are actively transcribed and translated into their respective functional products, the subset of the estimated 20,000 to 25,000 ncRNA genes that are transcribed, the amount of products produced, and the particular sequence (alleles) of those products. The gene expression profile of any particular cell or cell type in a given individual at a given time (whether in the context of the cell cycle, early development, or one's entire life span) and under a given set of circumstances (as influenced by environment, lifestyle, or disease) is thus the integrated sum of several different but interrelated effects, including the following:
• The primary sequence of genes, their allelic variants, and their encoded products
• Regulatory sequences and their epigenetic positioning in chromatin
• Interactions with the thousands of transcriptional factors, ncRNAs, and other proteins involved in the control of transcription, splicing, translation, and post-translational modification
• Organization of the genome into subchromosomal domains
• Programmed interactions between different parts of the genome
• Dynamic three-dimensional chromatin packaging in the nucleus
All of these orchestrate in an efficient, hierarchical, and highly programmed fashion. Disruption of any one—due to genetic variation, to epigenetic changes, and/or to disease-related processes—would be expected to alter the overall cellular program and its functional output (see Box).
The Epigenetic Landscape of the Genome and Medicine
• Different chromosomes and chromosomal regions occupy characteristic territories within the nucleus. The probability of physical proximity influences the incidence of specific chromosome abnormalities (see Chapters 5 and 6).
• The genome is organized into megabase-sized domains with locally shared characteristics of base pair composition (i.e., GC rich or AT rich), gene density, timing of replication in the S phase, and presence of particular histone modifications (see Chapter 5).
• Modules of coexpressed genes correspond to distinct anatomical or developmental stages in, for example, the human brain or the hematopoietic lineage. Such coexpression networks are revealed by shared regulatory networks and epigenetic signals, by clustering within genomic domains, and by overlapping patterns of altered gene expression in various disease states.
• Although monozygotic twins share virtually identical genomes, they can be quite discordant for certain traits, including susceptibility to common diseases. Significant changes in DNA methylation occur during the lifetime of such twins, implicating epigenetic regulation of gene expression as a source of diversity.
• The epigenetic landscape can integrate genomic and environmental contributions to disease. For example, differential DNA methylation levels correlate with underlying sequence variation at specific loci in the genome and thereby modulate genetic risk for rheumatoid arthritis.
Allelic Imbalance in Gene Expression
It was once assumed that genes present in two copies in the genome would be expressed from both homologues at comparable levels. However, it has become increasingly evident that there can be extensive imbalance between alleles, reflecting both the amount of sequence variation in the genome and the interplay between genome sequence and epigenetic patterns that were just discussed.
In Chapter 2, we introduced the general finding that any individual genome carries two different alleles at a minimum of 3 to 5 million positions around the genome, thus distinguishing by sequence the maternally and paternally inherited copies of that sequence position (see Fig. 2-6). Here, we explore ways in which those sequence differences reveal allelic imbalance in gene expression, both at autosomal loci and at X chromosome loci in females.
By determining the sequences of all the RNA products—the transcriptome—in a population of cells, one can quantify the relative level of transcription of all the genes (both protein-coding and noncoding) that are transcriptionally active in those cells. Consider, for example, the collection of protein-coding genes. Although an average cell might contain approximately 300,000 copies of mRNA in total, the abundance of specific mRNAs can differ over many orders of magnitude; among genes that are active, most are expressed at low levels (estimated to be < 10 copies of that gene's mRNA per cell), whereas others are expressed at much higher levels (several hundred to a few thousand copies of that mRNA per cell). Only in highly specialized cell types are particular genes expressed at very high levels (many tens of thousands of copies) that account for a significant proportion of all mRNA in those cells.
Now consider an expressed gene with a sequence variant that allows one to distinguish between the RNA products (whether mRNA or ncRNA) transcribed from each of two alleles, one allele with a T that is transcribed to yield RNA with an A and the other allele with a C that is transcribed to yield RNA with a G (Fig. 3-11). By sequencing individual RNA molecules and comparing the number of sequences generated that contain an A or G at that position, one can infer the ratio of transcripts from the two alleles in that sample. Although most genes show essentially equivalent levels of biallelic expression, recent analyses of this type have demonstrated widespread unequal allelic expression for 5% to 20% of autosomal genes in the genome (Table 3-2). For most of these genes, the extent of imbalance is twofold or less, although up to tenfold differences have been observed for some genes. This allelic imbalance may reflect interactions between genome sequence and gene regulation; for example, sequence changes can alter the relative binding of various transcription factors or other transcriptional regulators to the two alleles or the extent of DNA methylation observed at the two alleles (see Table 3-2).
FIGURE 3-11 Allelic expression patterns for a gene sequence with a transcribed DNA variant (here, a C or a T) to distinguish the alleles. As described in the text, the relative abundance of RNA transcripts from the two alleles (here, carrying a G or an A) demonstrates whether the gene shows balanced expression (top), allelic imbalance (center), or exclusively monoallelic expression (bottom). Different underlying mechanisms for allelic imbalance are compared in Table 3-2. SNP, Single nucleotide polymorphism.
Allelic Imbalance in Gene Expression
Monoallelic Gene Expression
Some genes, however, show a much more complete form of allelic imbalance, resulting in monoallelic gene expression (see Fig. 3-11). Several different mechanisms have been shown to account for allelic imbalance of this type for particular subsets of genes in the genome: DNA rearrangement, random monoallelic expression, parent-of-origin imprinting, and, for genes on the X chromosome in females, X chromosome inactivation. Their distinguishing characteristics are summarized in Table 3-2.
A highly specialized form of monoallelic gene expression is observed in the genes encoding immunoglobulins and T-cell receptors, expressed in B cells and T cells, respectively, as part of the immune response. Antibodies are encoded in the germline by a relatively small number of genes that, during B-cell development, undergo a unique process of somatic rearrangement that involves the cutting and pasting of DNA sequences in lymphocyte precursor cells (but not in any other cell lineages) to rearrange genes in somatic cells to generate enormous antibody diversity. The highly orchestrated DNA rearrangements occur across many hundreds of kilobases but involve only one of the two alleles, which is chosen randomly in any given B cell (see Table 3-2). Thus expression of mature mRNAs for the immunoglobulin heavy or light chain subunits is exclusively monoallelic.
This mechanism of somatic rearrangement and random monoallelic gene expression is also observed at the T-cell receptor genes in the T-cell lineage. However, such behavior is unique to these gene families and cell lineages; the rest of the genome remains highly stable throughout development and differentiation.
Random Monoallelic Expression
In contrast to this highly specialized form of DNA rearrangement, monoallelic expression typically results from differential epigenetic regulation of the two alleles. One well-studied example of random monoallelic expression involves the OR gene family described earlier (see Fig. 3-2). In this case, only a single allele of one OR gene is expressed in each olfactory sensory neuron; the many hundred other copies of the OR family remain repressed in that cell. Other genes with chemosensory or immune system functions also show random monoallelic expression, suggesting that this mechanism may be a general one for increasing the diversity of responses for cells that interact with the outside world. However, this mechanism is apparently not restricted to the immune and sensory systems, because a substantial subset of all human genes (5% to 10% in different cell types) has been shown to undergo random allelic silencing; these genes are broadly distributed on all autosomes, have a wide range of functions, and vary in terms of the cell types and tissues in which monoallelic expression is observed.
For the examples just described, the choice of which allele is expressed is not dependent on parental origin; either the maternal or paternal copy can be expressed in different cells and their clonal descendants. This distinguishes random forms of monoallelic expression from genomic imprinting, in which the choice of the allele to be expressed is nonrandom and is determined solely by parental origin. Imprinting is a normal process involving the introduction of epigenetic marks (see Fig. 3-8) in the germline of one parent, but not the other, at specific locations in the genome. These lead to monoallelic expression of a gene or, in some cases, of multiple genes within the imprinted region.
Imprinting takes place during gametogenesis, before fertilization, and marks certain genes as having come from the mother or father (Fig. 3-12). After conception, the parent-of-origin imprint is maintained in some or all of the somatic tissues of the embryo and silences gene expression on allele(s) within the imprinted region; whereas some imprinted genes show monoallelic expression throughout the embryo, others show tissue-specific imprinting, especially in the placenta, with biallelic expression in other tissues. The imprinted state persists postnatally into adulthood through hundreds of cell divisions so that only the maternal or paternal copy of the gene is expressed. Yet, imprinting must be reversible: a paternally derived allele, when it is inherited by a female, must be converted in her germline so that she can then pass it on with a maternal imprint to her offspring. Likewise, an imprinted maternally derived allele, when it is inherited by a male, must be converted in his germline so that he can pass it on as a paternally imprinted allele to his offspring (see Fig. 3-12). Control over this conversion process appears to be governed by specific DNA elements called imprinting control regions or imprinting centers that are located within imprinted regions throughout the genome; although their precise mechanism of action is not known, many appear to involve ncRNAs that initiate the epigenetic change in chromatin, which then spreads outward along the chromosome over the imprinted region. Notably, although the imprinted region can encompass more than a single gene, this form of monoallelic expression is confined to a delimited genomic segment, typically a few hundred kilobase pairs to a few megabases in overall size; this distinguishes genomic imprinting both from the more general form of random monoallelic expression described earlier (which appears to involve individual genes under locus-specific control) and from X chromosome inactivation, described in the next section (which involves genes along the entire chromosome).
FIGURE 3-12 Genomic imprinting and conversion of maternal and paternal imprints during passage through male or female gametogenesis. Within a hypothetical imprinted region on an pair of homologous autosomes, paternally imprinted genes are indicated in blue, whereas a maternally imprinted gene is indicated in red. After fertilization, both male and female embryos have one copy of the chromosome carrying a paternal imprint and one copy carrying a maternal imprint. During oogenesis (top) and spermatogenesis (bottom), the imprints are erased by removal of epigenetic marks, and new imprints determined by the sex of the parent are established within the imprinted region. Gametes thus carry a monoallelic imprint appropriate to the parent of origin, whereas somatic cells in both sexes carry one chromosome of each imprinted type.
To date, approximately 100 imprinted genes have been identified on many different autosomes. The involvement of these genes in various chromosomal disorders is described more fully in Chapter 6. For clinical conditions due to a single imprinted gene, such as Prader-Willi syndrome (Case 38) and Beckwith-Wiedemann syndrome (Case 6), the effect of genomic imprinting on inheritance patterns in pedigrees is discussed in Chapter 7.
X Chromosome Inactivation
The chromosomal basis for sex determination, introduced in Chapter 2 and discussed in more detail in Chapter 6, results in a dosage difference between typical males and females with respect to genes on the X chromosome. Here we discuss the chromosomal and molecular mechanisms of X chromosome inactivation, the most extensive example of random monoallelic expression in the genome and a mechanism of dosage compensation that results in the epigenetic silencing of most genes on one of the two X chromosomes in females.
In normal female cells, the choice of which X chromosome is to be inactivated is a random one that is then maintained in each clonal lineage. Thus females are mosaic with respect to X-linked gene expression; some cells express alleles on the paternally inherited X but not the maternally inherited X, whereas other cells do the opposite (Fig. 3-13). This mosaic pattern of gene expression distinguishes most X-linked genes from imprinted genes, whose expression, as we just noted, is determined strictly by parental origin.
FIGURE 3-13 Random X chromosome inactivation early in female development. Shortly after conception of a female embryo, both the paternally and maternally inherited X chromosomes (pat and mat, respectively) are active. Within the first week of embryogenesis, one or the other X is chosen at random to become the future inactive X, through a series of events involving the X inactivation center (black box). That X then becomes the inactive X (Xi, indicated by the shading) in that cell and its progeny and forms the Barr body in interphase nuclei. The resulting female embryo is thus a clonal mosaic of two epigenetically determined cell types: one expresses alleles from the maternal X (pinkcells), whereas the other expresses alleles from the paternal X (blue cells). The ratio of the two cell types is determined randomly but varies among normal females and among females who are carriers of X-linked disease alleles (see Chapters 6 and 7).
Although the inactive X chromosome was first identified cytologically by the presence of a heterochromatic mass (called the Barr body) in interphase cells, many epigenetic features distinguish the active and inactive X chromosomes, including DNA methylation, histone modifications, and a specific histone variant, macroH2A, that is particularly enriched in chromatin on the inactive X. As well as providing insights into the mechanisms of X inactivation, these features can be useful diagnostically for identifying inactive X chromosomes in clinical material, as we will see in Chapter 6.
Although X inactivation is clearly a chromosomal phenomenon, not all genes on the X chromosome show monoallelic expression in female cells. Extensive analysis of expression of nearly all X-linked genes has demonstrated that at least 15% of the genes show biallelic expression and are expressed from both active and inactive X chromosomes, at least to some extent; a proportion of these show significantly higher levels of mRNA production in female cells compared to male cells and are interesting candidates for a role in explaining sexually dimorphic traits.
A special subset of genes is located in the pseudoautosomal segments, which are essentially identical on the X and Y chromosomes and undergo recombination during spermatogenesis (see Chapter 2). These genes have two copies in both females (two X-linked copies) and males (one X-linked and one Y-linked copy) and thus do not undergo X inactivation; as expected, these genes show balanced biallelic expression, as one sees for most autosomal genes.
The X Inactivation Center and the XIST Gene.
X inactivation occurs very early in female embryonic development, and determination of which X will be designated the inactive X in any given cell in the embryo is a random choice under the control of a complex locus called the X inactivation center. This region contains an unusual ncRNA gene, XIST, that appears to be a key master regulatory locus for X inactivation. XIST (an acronym for inactive X [Xi]–specific transcripts) has the novel feature that it is expressed only from the allele on the inactive X; it is transcriptionally silent on the active X in both male and female cells. Although the exact mode of action of XIST is unknown, X inactivation cannot occur in its absence. The product of XIST is a long ncRNA that stays in the nucleus in close association with the inactive X chromosome.
Additional aspects and consequences of X chromosome inactivation will be discussed in Chapter 6, in the context of individuals with structurally abnormal X chromosomes or an abnormal number of X chromosomes, and in Chapter 7, in the case of females carrying deleterious mutant alleles for X-linked disease.
Variation in Gene Expression and Its Relevance to Medicine
The regulated expression of genes in the human genome involves a set of complex interrelationships among different levels of control, including proper gene dosage (controlled by mechanisms of chromosome replication and segregation), gene structure, chromatin packaging and epigenetic regulation, transcription, RNA splicing, and, for protein-coding loci, mRNA stability, translation, protein processing, and protein degradation. For some genes, fluctuations in the level of functional gene product, due either to inherited variation in the structure of a particular gene or to changes induced by nongenetic factors such as diet or the environment, are of relatively little importance. For other genes, even relatively minor changes in the level of expression can have dire clinical consequences, reflecting the importance of those gene products in particular biological pathways. The nature of inherited variation in the structure and function of chromosomes, genes, and the genome, combined with the influence of this variation on the expression of specific traits, is the very essence of medical and molecular genetics and is dealt with in subsequent chapters.
Brown TA. Genomes. ed 3. Garland Science: New York; 2007.
Lodish H, Berk A, Kaiser CA, et al. Molecular cell biology. ed 7. WH Freeman: New York; 2012.
Strachan T, Read A. Human molecular genetics. ed 4. Garland Science: New York; 2010.
References for Specific Topics
Bartolomei MS, Ferguson-Smith AC. Mammalian genomic imprinting. Cold Spring Harbor Perspect Biol. 2011;3:1002592.
Beck CR, Garcia-Perez JL, Badge RM, et al. LINE-1 elements in structural variation and disease. Annu Rev Genomics Hum Genet. 2011;12:187–215.
Berg P. Dissections and reconstructions of genes and chromosomes (Nobel Prize lecture). Science. 1981;213:296–303.
Chess A. Mechanisms and consequences of widespread random monoallelic expression. Nat Rev Genet. 2012;13:421–428.
Dekker J. Gene regulation in the third dimension. Science. 2008;319:1793–1794.
Djebali S, Davis CA, Merkel A, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
Gerstein MB, Bruce C, Rozowsky JS, et al. What is a gene, post-ENCODE? Genome Res. 2007;17:669–681.
Guil S, Esteller M. Cis-acting noncoding RNAs: friends and foes. Nat Struct Mol Biol. 2012;19:1068–1074.
Heyn H, Esteller M. DNA methylation profiling in the clinic: applications and challenges. Nature Rev Genet. 2012;13:679–692.
Hubner MR, Spector DL. Chromatin dynamics. Annu Rev Biophys. 2010;39:471–489.
Li M, Wang IX, Li Y, et al. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011;333:53–58.
Nagano T, Fraser P. No-nonsense functions for long noncoding RNAs. Cell. 2011;145:178–181.
Willard HF. The human genome: a window on human genetics, biology and medicine. Ginsburg GS, Willard HF. Genomic and personalized medicine. ed 2. Elsevier: New York; 2013.
Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet. 2012;12:7–18.
1. The following amino acid sequence represents part of a protein. The normal sequence and four mutant forms are shown. By consulting Table 3-1, determine the double-stranded sequence of the corresponding section of the normal gene. Which strand is the strand that RNA polymerase “reads”? What would the sequence of the resulting mRNA be? What kind of mutation is each mutant protein most likely to represent?
Mutant 1 -lys-arg-his-his-cys-leu-
Mutant 2 -lys-arg-ile-ile-ile-
Mutant 3 -lys-glu-thr-ser-leu-ser-
Mutant 4 -asn-tyr-leu-
2. The following items are related to each other in a hierarchical fashion: chromosome, base pair, nucleosome, kilobase pair, intron, gene, exon, chromatin, codon, nucleotide, promoter. What are these relationships?
3. Describe how mutation in each of the following might be expected to alter or interfere with normal gene function and thus cause human disease: promoter, initiator codon, splice sites at intron-exon junctions, one base pair deletion in the coding sequence, stop codon.
4. Most of the human genome consists of sequences that are not transcribed and do not directly encode gene products. For each of the following, consider ways in which these genome elements might contribute to human disease: introns, Alu or LINE repetitive sequences, locus control regions, pseudogenes.
5. Contrast the mechanisms and consequences of RNA splicing and somatic rearrangement.
6. Consider different ways in which mutations or variation in the following might lead to human disease: epigenetic modifications, DNA methylation, miRNA genes, lncRNA genes.
7. Contrast the mechanisms and consequences of genomic imprinting and X chromosome inactivation.