Gene expression differs among tissues and—in any tissue—may vary in response to external stimuli
The haploid human genome contains 20,000 to 30,000 distinct genes, but only about one third of these genes are actively translated into proteins in any individual cell. Cells from different tissues have distinct morphological appearances and functions and respond differently to external stimuli, even though their DNA content is identical. For example, although all cells of the body contain an albumin gene, only liver cells (hepatocytes) can synthesize and secrete albumin into the bloodstream. Conversely, hepatocytes cannot synthesize insulin, which pancreatic β cells produce. The explanation for these observations is that expression of genes is regulated so that some genes are active in hepatocytes and others are silent. In pancreatic β cells, a different set of genes is active; others, such as those expressed only in the liver, are silent. How does the organism program one cell type to express liver-specific genes, and another to express a set of genes appropriate for the pancreas? This phenomenon is called tissue-specific gene expression.
A second issue is that genes in individual cells are generally not expressed at constant, unchanging levels (constitutive expression). Rather, their expression levels often vary widely in response to environmental stimuli. For example, when blood glucose levels decrease, α cells in the pancreas secrete the hormone glucagon (see pp. 1050–1053). Glucagon circulates in the blood until it reaches the liver, where it causes a 15-fold increase in expression of the gene that encodes phosphoenolpyruvate carboxykinase (PEPCK), an enzyme that catalyzes the rate-limiting step in gluconeogenesis (see pp. 1051). Increased gluconeogenesis then contributes to restoration of blood glucose levels toward normal. This simple regulatory loop, which necessitates that the liver cells perceive the presence of glucagon and stimulate PEPCK gene expression, illustrates the phenomenon of inducible gene expression.
Genetic information flows from DNA to proteins
The “central dogma of molecular biology” states that genetic information flows unidirectionally from DNA to proteins. DNA is a polymer of nucleotides, each containing a nitrogenous base (adenine, A; guanine, G; cytosine, C; or thymine, T) attached to deoxyribose 5′-phosphate. The polymerized nucleotides form a polynucleotide strand in which the sequence of the nitrogenous bases constitutes the genetic information. With few exceptions, all cells in the body share the same genetic information. Hydrogen-bond formation between bases (A and T, or G and C) on the two complementary strands of DNA produces a double-helical structure.
DNA has two functions. The first is to serve as a self-renewing data repository that maintains a constant source of genetic information for the cell. This role is achieved by DNA replication, which ensures that when cells divide, the progeny cells receive exact copies of the DNA. The second purpose of DNA is to serve as a template for the translation of genetic information into proteins, which are the functional units of the cell. This second purpose is broadly defined as gene expression.
Gene expression involves two major processes (Fig. 4-1). The first process—transcription—is the synthesis of RNA from a DNA template, mediated by an enzyme called RNA polymerase II. The resultant RNA molecule is identical in sequence to one of the strands of the DNA template except that the base uracil (U) replaces thymine (T). The second process—translation—is the synthesis of protein from RNA. During translation, the genetic code in the sequence of RNA is “read” by transfer RNA (tRNA), and then amino acids carried by the tRNA are covalently linked together to form a polypeptide chain. In eukaryotic cells, transcription occurs in the nucleus, whereas translation occurs on ribosomes located in the cytoplasm. Therefore, an intermediary RNA, called messenger RNA (mRNA), is required to transport the genetic information from the nucleus to the cytoplasm. The complete process, proceeding from DNA in the nucleus to protein in the cytoplasm, constitutes gene expression.
FIGURE 4-1 Pathway from genes to proteins. Gene expression involves two major processes. First, the DNA is transcribed into RNA by RNA polymerase. Second, the RNA is translated into protein on the ribosomes.
Although the central dogma of molecular biology applies to most protein-coding genes, exceptions exist. For example, RNA viruses (such as the human immunodeficiency virus [HIV] that causes acquired immunodeficiency syndrome) contain their genetic information in the sequence of an RNA genome. Upon infection with HIV, the cell “reverse transcribes” the RNA genome into double-stranded DNA that then integrates into the host DNA genome. Transcription of the virally encoded DNA by the host transcriptional machinery produces RNA molecules that become part of new HIV particles. Cells transcribe some genes into RNAs that do not encode proteins. So-called noncoding RNAs include ribosomal RNAs (rRNAs) and transfer RNA (tRNA) that participate in protein translation, small nuclear RNAs (snRNAs) that are involved in RNA splicing, and microRNAs (miRNAs) that regulate mRNA abundance and translation (see pp. 99–100).
The gene consists of a transcription unit
Figure 4-2 depicts the structure of a typical eukaryotic protein-coding gene. The gene consists of a segment of DNA that is transcribed into RNA. It extends from the site of transcription initiation to the site of transcription termination. The region of DNA that is immediately adjacent to and upstream (i.e., in the 5′ direction) from the transcription initiation site is called the 5′ flanking region. The corresponding domain that is downstream (3′) to the transcription termination site is called the 3′ flanking region. (Recall that DNA strands have directionality because of the 5′ to 3′ orientation of the phosphodiester bonds in the sugar-phosphate backbone of DNA. By convention, the DNA strand that has the same sequence as the RNA is called the coding strand, and the complementary strand is called the noncoding strand. The 5′ to 3′ orientation refers to the coding strand.) Although the 5′ and 3′ flanking regions are not transcribed into RNA, they frequently contain DNA sequences, called regulatory elements, that control gene transcription. The site where transcription of the gene begins, sometimes called the cap site, may have a variant of the nucleotide sequence 5′-ACTT(T/C)TG-3′ (called the cap sequence), where T/C means T or C. The A is the transcription initiation site. Transcription proceeds to the transcription termination site, which has a less defined sequence and location in eukaryotic genes. Slightly upstream from the termination site is another sequence called the polyadenylation signal, which often has the sequence 5′-AATAAA-3′.
FIGURE 4-2 Structure of a eukaryotic gene and its products. The figure depicts a gene, a primary RNA transcript, the mature mRNA, and the resulting protein. The 5′ and 3′ numbering of the gene refers to the coding strand. ATG, AATAAA, and the like, are nucleotide sequences. m7G, 7-methyl guanosine.
The RNA that is initially transcribed from a gene is called the primary transcript (see Fig. 4-2) or precursor mRNA (pre-mRNA). Before it can be translated into protein, the primary transcript must be processed into a mature mRNA in the nucleus. Most eukaryotic genes contain exons, DNA sequences that are present in the mature mRNA, alternating with introns, which are not present in the mRNA. The primary transcript is colinear with the coding strand of the gene and contains the sequences of both the exons and the introns. To produce a mature mRNA that can be translated into protein, the cell must process the primary transcript in four steps.
First, the cell adds an unusual guanosine base, which is methylated at the 7 position, via a 5′-5′ phosphodiester bond to the 5′ end of the transcript. The result is a 5′ methyl cap. The presence of the 5′ methyl cap is required for export of the mRNA from the nucleus to the cytoplasm as well as for translation of the mRNA.
Second, the cell removes the sequences of the introns from the primary transcript via a process called pre-mRNA splicing. Splicing involves the joining of the sequences of the exons in the RNA transcript and the removal of the intervening introns. As a result, mature mRNA (see Fig. 4-2) is shorter and not colinear with the coding strand of the DNA template.
The third processing step is cleavage of the RNA transcript about 20 nucleotides downstream from the polyadenylation signal, near the 3′ end of the transcript.
The fourth step is the addition of a string of 100 to 200 adenine bases at the site of the cleavage to form a poly(A) tail. This tail contributes to mRNA stability.
The mature mRNA produced by RNA processing not only contains a coding region—the open-reading frame—that encodes protein but also sequences at the 5′ and 3′ ends that are not translated into protein—the 5′ and 3′ untranslated regions (UTRs), respectively. Translation of the mRNA on ribosomes always begins at the codon AUG, which encodes methionine, and proceeds until the ribosome encounters one of the three stop codons (UAG, UAA, or UGA). Thus, the 5′ end of the mRNA is the first to be translated and provides the N terminus of the protein; the 3′ end is the last to be translated and contributes the C terminus.
DNA is packaged into chromatin
Although DNA is commonly depicted as linear, chromosomal DNA in the nucleus is actually organized into a higher-order structure called chromatin. This packaging is required to fit DNA with a total length of ~1 m into a nucleus with a diameter of 10−5 m. Chromatin consists of DNA associated with histones and other nuclear proteins. The basic building block of chromatin is the nucleosome (Fig. 4-3), each of which consists of a protein core and 147 base pairs (bp) of associated DNA. The protein core is an octamer of the histones H2A, H2B, H3, and H4. DNA wraps twice around the core histones to form a solenoid-like structure. A linker histone, H1, associates with segments of DNA between nucleosomes. Regular arrays of nucleosomes have a beads-on-a-string appearance and constitute the so-called 11-nm fiber of chromatin, which can condense to form the 30-nm fiber.
FIGURE 4-3 Chromatin structure.
Chromatin exists in two general forms that can be distinguished cytologically by their different degrees of condensation. Heterochromatin is a highly condensed form of chromatin that is transcriptionally inactive. In general, highly organized chromatin structure is associated with repression of gene transcription. Heterochromatin contains mostly repetitive DNA sequences and relatively few genes. Euchromatin has a more open structure and contains genes that are actively transcribed. Even in the transcriptionally active “open” euchromatin, local chromatin structure may influence the activity of individual genes.
Gene expression may be regulated at multiple steps
Gene expression involves eight steps (Fig. 4-4):
Step 1: Chromatin remodeling. Before a gene can be transcribed, some local alteration in chromatin structure must occur so that the enzymes that mediate transcription can gain access to the genomic DNA. The alteration in chromatin structure is called chromatin remodeling, which may involve loosening of the interaction between histones and DNA, repositioning of nucleosomes, or local depletion of histones.
Step 2: Initiation of transcription. In this step, RNA polymerase is recruited to the gene promoter and begins to synthesize RNA that is complementary in sequence to one of the strands of the template DNA. For most eukaryotic genes, initiation of transcription is the critical, rate-limiting step in gene expression.
Step 3: Transcript elongation. During transcript elongation, RNA polymerase proceeds down the DNA strand and sequentially adds ribonucleotides to the elongating strand of RNA. N4-1
Role of Tat in Transcript Elongation
Contributed by Peter Igarashi
Regulation of elongation appears to be critical for the expression of certain genes, such as some genes of HIV-1, the causative agent of acquired immunodeficiency syndrome (AIDS). HIV-1 is a retrovirus (RNA virus) that preferentially infects cells of the immune system. After infection, the RNA viral genome is “reverse” transcribed into double-stranded DNA, which integrates into the host genome. A viral promoter that is located in the long terminal repeat of the viral genome then drives expression of the viral genes. Immediately downstream from the promoter—and within the 5′untranslated region—is a regulatory element known as the trans-activation response element (TAR). Unlike the regulatory elements that we have discussed above, this element is active in transcribed RNA. The sequence of TAR contains an inverted repeat, and a stretch of nucleotides on one part of the TAR pairs with nucleotides on the other part to create a hairpin structure in this viral transcript (eFig. 4-1). Because the inverted repeat is imperfect, the hairpin contains a “bulge.” Elongation of transcription cannot occur unless a virally encoded protein called Tat binds to this bulge in the TAR portion of the RNA transcript. In the absence of Tat, transcription initiates but elongation does not proceed past the TAR; the resulting truncated transcripts do not encode proteins. In the presence of Tat, Pol II can read through the TAR and elongation proceeds normally, producing full-length RNA. It appears that the function of TAR is to recruit Tat to the promoter. Tat, in turn, associates with P-TEFb, a kinase that phosphorylates the CTD of Pol II (see p. 85 and Fig. 4-10) and stimulates transcription elongation.
EFIGURE 4-1 Role of Tat in transcript elongation. A, The TAR (trans-activation response element) of the newly transcribed RNA forms a hairpin loop with a bulge. In the absence of Tat, transcription terminates prematurely and releases a short strand of RNA. B, Tat binds to the bulge and recruits a coactivator that allows transcript elongation to form full-length RNAs.
Step 4: Termination of transcription. After producing a full-length RNA, the enzyme halts elongation.
Step 5: RNA processing. As noted before, RNA processing involves (a) addition of a 5′ methylguanosine cap, (b) pre-mRNA splicing, (c) cleavage of the RNA strand, and (d) polyadenylation.
Step 6: Nucleocytoplasmic transport. The next step in gene expression is the export of the mature mRNA through pores in the nuclear envelope (see p. 21) into the cytoplasm. Nucleocytoplasmic transport is a regulated process that is important for mRNA quality control.
Step 7: Translation. The mRNA is translated into proteins on ribosomes. During translation, the genetic code on the mRNA is read by tRNA, and then amino acids carried by the tRNA are added to the nascent polypeptide chain.
Step 8: mRNA degradation. Finally, the mRNA is degraded in the cytoplasm by a combination of endonucleases and exonucleases.
FIGURE 4-4 Steps in gene expression. Nearly all of the eight steps in gene expression are potential targets for regulation.
Each of these steps is potentially a target for regulation (see Fig. 4-4, right panel):
1. Gene expression may be regulated by global as well as by local alterations in chromatin structure.
2. An important related alteration in chromatin structure is the state of methylation of the DNA.
3. Initiation of transcription can be regulated by transcriptional activators and transcriptional repressors.
4. Transcript elongation may be regulated by premature termination in which the polymerase falls off (or is displaced from) the template DNA strand; such termination results in the synthesis of truncated transcripts.
5. Pre-mRNA splicing may be regulated by alternative splicing, which generates different mRNA species from the same primary transcript.
6. At the step of nucleocytoplasmic transport, the cell prevents expression of aberrant transcripts, such as those with defects in mRNA processing. In addition, mutant transcripts containing premature stop codons may be degraded in the nucleus through a process called nonsense-mediated decay.
7. Control of translation of mRNA is a regulated step in the expression of certain genes, such as the transferrin receptor gene.
8. Control of mRNA stability contributes to steady-state levels of mRNA in the cytoplasm and is important for the overall expression of many genes.
Although any of these steps may be critical for regulating a particular gene, transcription initiation is the most frequently regulated (step 2) and is the focus of this chapter. At the end of the chapter, we describe examples of epigenetic regulation of gene expression and regulation at steps that are subsequent to the initiation of transcription—post-transcriptional regulation.
Transcription factors are proteins that regulate gene transcription
A general principle is that gene transcription is regulated by interactions of specific proteins with specific DNA sequences. The proteins that regulate gene transcription are called transcription factors. Many transcription factors recognize and bind to specific sequences in DNA. The binding sites for these transcription factors are called regulatory elements. Because they are located on the same piece of DNA as the genes that they regulate, these regulatory elements are sometimes referred to as cis-acting factors.
Figure 4-5 illustrates the overall scheme for the regulation of gene expression. Transcription requires proteins (transcription factors) that bind to specific DNA sequences (regulatory elements) located near the genes they regulate (target genes). Once the proteins bind to DNA, they stimulate (or inhibit) transcription of the target gene. A particular transcription factor can regulate the transcription of multiple target genes. In general, regulation of gene expression can occur at the level of either transcription factors or regulatory elements. Examples of regulation at the level of transcription factors include variations in the abundance of transcription factors, their DNA-binding activities, and their ability to stimulate (or to inhibit) transcription. Examples of regulation at the level of regulatory elements include alterations in chromatin structure (which influences accessibility to transcription factors) and covalent modifications of DNA, especially methylation.
FIGURE 4-5 Regulation of transcription. Protein A, a transcription factor that is encoded by gene A (not shown), regulates another gene, gene B. Protein A binds to a DNA sequence (a regulatory element) that is upstream from gene B; this DNA sequence is a cis-acting element because it is located on the same DNA as gene B. In this example, protein A stimulates (transactivates) the transcription of gene B. Transcription factors also can inhibit transcription.