BLAST (Basic Local Alignment Search Tool)

Chapter 2. Biological Sequences

Sequence similarity is a powerful tool for discovering biological function. Just as the ancient Greeks used comparative anatomy to understand the human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs, today we can use comparative sequence analysis to understand genomes, RNAs, and proteins. But why are biological sequences similar to one another in the first place? The answer to this question isn't simple and requires an understanding of molecular and evolutionary biology.

2.1 The Central Dogma of Molecular Biology

Most courses in molecular biology begin with the Central Dogma of Molecular Biology, which describes the path by which information contained in DNA is converted to protein molecules with specific functions. Stated simply, the Central Dogma is: "from DNA to RNA to protein." Figure 2-1 shows a more complete diagram of this process and will be referenced throughout this section.

Figure 2-1. The Central Dogma of Molecular Biology: DNA to RNA to protein


2.1.1 DNA

The hereditary material that carries the blueprint for an organism from one generation to the next is called deoxyribonucleic acid. It is much more commonly referred to by its acronym, DNA. Every time cells divide, the DNA is duplicated in a process called DNA replication. The entire DNA of an organism is called its genome, and genomes are sometimes called "the book of life" (especially with respect to the human genome). Reading and understanding the various books of life is one of the most important quests of the genomic age. Modern medicine, agriculture, and industry will increasingly depend on an intimate knowledge of genomes to develop individualized medicines, select and modify the most desirable traits in plants and animals, and understand the relationships among species.

The language of DNA is complicated. Over the last 50 years, scientists have begun to decipher it, but it is still largely a mystery. Although the language is elusive, the alphabet is simple, consisting of just four nucleotides: adenine, cytosine, guanine, and thymine. For simplicity in both speech and on the computer, they are usually abbreviated as A, C, G, and T. DNA usually exists as a double-stranded molecule, but we generally talk about just one strand at a time. Here's an example of a DNA sequence that is six nucleotides (nt) long:


DNA has polarity, like a battery, but its ends are referred to as 5-prime (5´) and 3-prime (3´) rather than plus and minus. This nomenclature comes from the chemical structure of DNA. While it isn't necessary to understand the chemical structure, the terminology is important. For example, when people say "the 5´ end of the gene," they mean the beginning of the gene. We usually display DNA sequence as we read text, left to right, and the convention is that the left side is the 5´ end and the right side is the 3´ end.

In addition to the 4-letter alphabet, there is also a 15-letter DNA alphabet used to describe nucleotide ambiguities (Table 2-1). The most common noncanonical DNA symbol is N, which stands for an unknown nucleotide. Other common ones include R and Y.

Table 2-1. Nucleotide ambiguity codes





A or G



C or T



A or T

Weak hydrogen bonds


G or C

Strong hydrogen bonds


G or T

Keto in major groove


A or C

aMino in major groove


C, G, or T

not A


A, G, or T

not C


A, C, or T

not G


A, C, or G

not T


A, C, G, or T


The pairing rule of DNA is that A pairs with T, and C pairs with G. It is very easy to determine the sequence of the complementary strand of any DNA sequence. In double-stranded form, the 6 base pairs (bp) of DNA above looks like this:



In this example, if you read the bottom strand backward, it is the same as the top strand read forward. Such palindromes are often of biological interest. This particular one is the recognition site for an enzyme called EcoRI that cuts DNA at this sequence. This is an example of how information can be gleaned simply from analyzing the primary sequence. Palindromes and other patterns often give clues to the function of small stretches of DNA.

But why is DNA double stranded? The answer is because the molecule is chemically more stable that way, and the double-stranded structure also allows some error correction if a base is accidentally damagedfor example by UV irradiation from too much sunlight. (This is a good reason to wear sunscreen.) DNA by itself doesn't do much. It's just a storehouse for information. For the computer scientists in the audience: think of the genome as a hard disk with RAID mirroring that stores A's, C's, G's, and T's instead of 1s and 0s.

Before we continue with the Central Dogma, we'll discuss genes. What is a gene? Like many complicated problems, this is a question for which five experts would give you six different answers. For our purposes, a gene is a functional unit of the genome (a purposefully vague definition). Most genes contain instructions for producing proteins at a certain time and in a certain space. Some genes have very narrow windows of activity, while others are ubiquitous. Not all genes code for proteins, however. Some genes produce RNAs that aren't translated into proteins and are therefore called noncoding RNAs (ncRNA). So we've already deviated from the Central Dogma. Molecular biology is filled with rules that are constantly violated. (In fact, that's one of the first rules!) Molecular biology is also filled with names and acronyms that may be new to you. To help you keep track of them, this book includes most of them in the Glossary.

2.1.2 RNA

As mentioned earlier, DNA doesn't do much on its own. The excitement starts when DNA is copied into RNA by a protein called RNA polymerase in a process called transcription. Chemically, RNA is a lot like DNA except that it uses uracil instead of thymine and is single stranded instead of double stranded. The RNA alphabet is A, C, G, and U, and an RNA molecule might look like this:


What happens to the RNA transcript from a gene? If it is a transfer RNA (tRNA), ribosomal RNA (rRNA), or other ncRNA, it may undergo some chemical modifications, but the gene product remains as an RNA molecule. RNAs corresponding to protein coding genes are called messenger RNAs (mRNA).

2.1.3 Protein

Proteins make up the "buildings" and "machines" inside a cell. They are chemically very different from DNA and RNA because they are composed of amino acids (often abbreviated aa) rather than nucleic acids. Proteins have a useful property: they can fold into very specific three-dimensional shapes that are dependent on their amino acid sequences. Thus, the amino acid sequence determines the shape of the protein and the shape determines the function. A protein shaped like a stiff rod may be used as a structural support. Collagen and keratin are such proteins and make skin and hair durable. A protein with a hook may be used as a part of a ratcheting motor. A good example of this is myosin, which is found in muscle cells. Therefore, while DNA and RNA are largely used to store and send information, proteins make things happen.

The protein alphabet commonly contains 20 symbols, A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. The names, abbreviations, and structures of the amino acids are shown in Table 2-2.

Table 2-2. Amino acids

Amino acid













Neutral; forms disulfide bridges





Negatively charged





Negatively charged





Hydrophobic; aromatic





Neutral; smallest amino acid





Positively charged; aromatic










Positively charged










Hydrophobic; start amino acid





Neutral ; hydrophilic










Neutral ; hydrophilic





Positively charged





Neutral; hydrophilic





Neutral ; hydrophilic










Hydrophobic; aromatic





Hydrophobic; aromatic


Using one-letter symbols, a protein sequence might be written like this:


Like DNA and RNA, proteins also have polarity, and the nomenclature comes from the chemical structure. Here again, the convention is to display the sequence from left to right. In proteins, the left end is called the N-terminus and the right end is called the C-terminus. Thus, when people say, "the N-terminus is often removed after translation," they're talking about the beginning of the protein. Remember that all proteins start with the amino acid methionine (M). This is another of the universal laws of molecular biology, and like all biological laws, it is occasionally violated.

The sequences of proteins are one-dimensional, but their shapes are three-dimensional, or four-dimensional if you take into account that they're not frozen in time and can change their shape depending on their environment. It's worth remembering because most of this book talks about proteins as one-dimensional sequences and not shapes, and this approximation is frequently at odds with reality. Let's take a brief sojourn into protein folding and structure to see why this is.

First, just to make sure you get your daily dose of jargon, the sequence of amino acids is called the 1° structure of the protein (this is read as "primary structure," not "1st degree"). Proteins in aqueous solution usually have a globular structure; that is, they aren't sprawled out all over the place but adopt a compact structure. How do they get this way? Many proteins fold into their final structure by themselves because it represents the "easiest" shape they can adopt. But some proteins need a little help, and they receive assistance from other proteins in the cell called chaperones. Amino acid chemistry is beyond the scope of this chapter, but note that amino acids can be classified as hydrophobic ("fears water") or hydrophilic ("likes water"). Hydrophobic amino acids are like oils: they don't mix well with water and prefer to clump together in blobs rather than disperse. When a protein folds, the hydrophobic parts tend to aggregate. This creates a globular structure in which the inside is composed of hydrophobic amino acids, and the exterior is composed of hydrophilic amino acids. Of course, the complete story is much more complicated, but this provides a convenient way to think about protein folding and structure.

Although proteins come in many different shapes and sizes, if you look closely at the structure, you can find recurring structural themes that biologists call 2° (secondary) structure. The most common themes are the a-helix, b-sheet, and random coil. In Figure 2-2, these themes are represented as cylinders, arrows, and squiggly lines.

Figure 2-2. Structure of immunoglobulin domain


2.1.4 The Genetic Code

How is the information in DNA and RNA translated to protein sequence? A complex machine composed of proteins and ncRNAs called the ribosome reads an mRNA sequence and writes a protein sequence. The mRNA is read three nucleotides at a time. The nucleotide triplets are called codons. Each codon corresponds to a single amino acid. The mapping from codons to amino acids is called the genetic code, and its discovery is one of the great achievements in molecular biology. The genetic code is one of the universal laws of molecular biology (and, as you should expect, is sometimes broken).

Because codons are three nucleotides long and there are four possible nucleotides at each position, it follows that there are 64 (43) possible codons. However, there are only 20 amino acids. Thus there is redundancy in the genetic code and in turn, the code is often described as degenerate. Figure 2-3 shows the standard nuclear genetic code (there are more than a dozen different genetic codes, mostly from different mitochondrial genomes). If you look closely at the redundancies, you will find patterns. For example, the third position of a codon is often insignificant; A, C, G, or T all lead to the same translation. When this isn't the case, A and G are usually synonymous, as are C and T. It so happens that A and G belong to the same chemical class, called purines, and C and T belong to another class, called pyrimidines, so this makes sense in a biochemical way. There are other neat patterns, such as any codon with a T in the middle translates to a hydrophobic amino acid. In addition to the amino acids, there are three stop codons. When a ribosome sees a stop codon, translation terminates, and the protein is released to go about its business. As mentioned before, all proteins start with the amino acid methionine. This has only one codon, ATG, and so ATG is often called the start codon.

Figure 2-3. Standard codon translation


Consider the following nucleotide sequence.


If you translate this from the first letter, you get the protein sequence:


But what if you translate it from the second nucleotide? You get a different protein sequence (note that the fractional codon AC at the end of the DNA translates to threonine no matter what the next nucleotide is):


Because codons are three nucleotides long, you can translate DNA in three different reading frames. Since DNA is double stranded, there are really six reading frames for every piece of DNA. So if someone hands you a DNA sequence and asks you to translate it, you may have a little trouble.

2.2 Evolution

BLAST works because evolution is happening. Biological sequences show complex patterns of similarity to one another. In this regard, they mirror the external morphologies of the organisms in which they reside. You'll notice that birds, for example, show natural groupings. You don't have to be a biologist to see that ducks, geese, and swans comprise a reasonably natural group called the waterfowl, and that the similarities between ducks and geese seem too great to explain by mere coincidence. Biological sequences are no different. After all, the reason why ducks look like ducks and geese look like geese is because of their genes. Many molecular biologists are convinced that understanding sequence evolution is tantamount to understanding evolution itself.

Sequences change over time due to three forces: mutation, natural selection, and genetic drift. If you use BLAST, it's important to understand these forces because they form the biological foundation of similarity searches. The biological and mathematical foundations aren't the same, and are sometimes at odds. You need to understand both theories in order to knowledgeably interpret the sequence alignments in a BLAST report.

2.2.1 Mutation

mutation is simply a change in a DNA sequence. What causes mutation? Many chemicals and conditions damage DNA, so its sequence either changes or ceases to be recognizable. Mutagenic agents are often called carcinogens because cancer is caused by the accumulation of mutations in genes that control cell division. But even in a world without carcinogens there would still be mutation because the process of DNA replication isn't perfect. Every time a cell divides, it must duplicate its DNA. The human genome is about three billion letters long, and the error rate of DNA replication is about one error in every 300 million letters, so you can expect about 10 mutations per genome duplication. Genome size varies, as does the replication error rate, so don't take the 10 mutations per genome replication as any kind of biological truth. Human beings are composed of about a trillion cells, and you might take a moment now and consider just how much mutation is going on in your own body. Whatever that large number is, it's infinitesimal compared to what's happening in the biosphere as a whole.

What happens when a mutation occurs in the protein-coding portion of a gene? Because the DNA is mutated, the mRNA is also mutated. This in turn may lead to a different protein, but not necessarily, because the genetic code is degenerate. Take a look at an example for which you mutate just one letter in a coding sequence. If the mutation changed a codon from TTA to TTG, for example, the protein would be unchanged because both codons translate to the amino acid leucine. Such mutations are called silentsynonymous, or same-sensebecause they don't affect the protein sequence in any way. If the mutation changed a TTA to a TTT, however, the codon would code for a different amino acid, phenylalanine. Such substitutions are called mis-sense mutations. Molecular biologists will often classify mis-sense mutations into either conservative or nonconservative substitutions, depending on whether the two amino acids are chemically similar to one another. Leucine and phenylalanine are both hydrophobic amino acids, and such a substitution would be considered conservative. Bioinformaticists, however, give a more rigorous and quantifiable definition of conservative (see Chapter 4). If the TTA codon is mutated to TAA, the codon becomes a stop codon, which causes the ribosome to stop translating the mRNA. This represents the most destructive kind of mutation, and is called a non-sense mutation. Non-sense mutations cause translation to terminate prematurely, and the result is a truncated protein that may function partially, not function at all, or be poisonous to the cell.

Not all mutations substitute one nucleotide for another. Some mutations may insert or remove nucleotides. In addition, there are duplications, inversions, and other large-scale rearrangements that destroy genes or even fuse them together. Insertions and deletions are often destructive because they change the reading frame of translation if they aren't additions/subtractions of a multiple of three (a whole codon). After such a frame-shift mutation, there are usually several mis-sense mutations caused by the out-of-frame codons, and then a premature stop codon that was not previously in frame. Insertions and deletions are therefore usually as disruptive as mis-sense mutations.

What happens to an organism with mutations? It depends on a lot of factors. A mutation may have disastrous consequences, it might prove beneficial, or it might have no effect at all. To understand the forces that govern sequence evolution, let's take a close look at natural selection and genetic drift.

2.2.2 Natural Selection

The theory of natural selection was developed to explain why organisms look the way they do and why they seem to "fit" their environments so well. For example, why do giraffes have such long necks? Historically, there have been a lot of explanations, but we'll skip those debates and focus on the theory of natural selection because it is simple and fits the data well. The theory has only three assumptions.

· There must be variation within a population.

· The variation must be heritable.

· There must be differential reproduction based on variation.

In the case of the giraffe ancestor, those individuals with slightly longer necks were at an advantage because they could reach leaves higher in the trees. This advantage translates to more surviving offspring, and since the variation is heritable, they too will tend to have longish necks. Now, within this population of longish necked pre-giraffes, there is still more variation, and the cycle of selecting for longer-necked individuals can persist until you have something that looks like a modern giraffe. People often look at the organisms today and think that their form is "complete." But all organisms are undergoing change from one generation to the next. When you look at a giraffe, try thinking about it as a particular form at a snapshot in time, on its way to something perhaps taller, or shorter, or with wings and horns and a penchant for breathing fire.

When Charles Darwin formulated the theory of natural selection, he had no idea about mutations, DNA, proteins, or the genetic code. The theory was based solely on observation; there was no known mechanism. In the last 50 years, the advances in molecular biology have revolutionized our understanding of natural selection. We now understand why there is variation and what is being selected for and against. The why is that variation exists at the DNA level (called alleles by geneticists). The what is differences in genes.

Consider how protein structure is selected for or against. What if a mutation causes an amino acid in the hydrophobic core of a protein to be changed to something hydrophilic? Well, it probably wouldn't fold the same way anymore because the hydrophobic core of the globular structure now has a part that wants to be in an aqueous environment. In most cases, changes in protein structure are unfavorable and therefore selected against; however, sometimes they result in altered function, which is favorable in certain conditions. Such is the case with sickle cell anemia. A charged amino acid (glutamate) is changed to a hydrophobic one (valine), causing altered protein interactions at the surface. Disease results when both alleles of the gene have this change, but it offers some protection against malaria when present in only one allele. As natural selection would predict, the sickle cell allele, and therefore sickle cell anemia, is prominent in certain parts of the world where malaria is common.

Several take-home messages are worth stating quite clearly. First, there is an inexhaustible source of variation because mutation is constantly happening. Natural selection isn't going to run out of variation. Evolution isn't going to stop. Second, a mutation can't be declared either good or bad on its own. Even a mutation that introduces a stop codon can be beneficial. Look at seedless oranges. It might seem an abomination of nature that they can't reproduce by themselves, but it is this very fact that makes humans breed them. To the seedless orange, genes that allow seeds to form are the kiss of death.

2.2.3 Genetic Drift

The interplay between mutation and natural selection that was just outlined makes a nice story. Like most stories, though, the truth is a lot more complicated. Reading the previous section, you may have concluded that natural selection is an all-powerful force, responsible for determining every nucleotide in a DNA sequence. In such a world, you would expect proteins to be perfectly functioning machines and the DNA sequences that encode them to be the best possible sequence for the job. This might be true in a mathematical model involving infinite population size and limitless generations, but the real biological world is a harsh place subject to happenstance. Even if the highly advantageous mutation enabling X-ray vision were to arise in some individual, it might not end up in the gene pool if that person thinks he's Superman and tries to stop a runaway train.

Darwin was not aware of how variation is transmitted from generation to generation; he didn't have the concept of genes. Genes were introduced by Gregor Mendel to explain how hereditary information is transmitted from one generation to the next. Combining Mendelian genetics and natural selection led to the field of population genetics, which is chiefly concerned with the changes in allele frequencies over time. Mathematical simulations show quite clearly that allele frequencies can change by purely random processes. This behavior is called genetic drift, and it's based on the fact that populations aren't infinitely large.

Let's demonstrate genetic drift with an example. For simplicity, let's ignore new mutations and just consider an anonymoussite that has no consequence in natural selection. Assume there are only 10 individuals in the population, and that 5 have a C at this position and 5 have a T. Keeping the population fixed, in the next generation, the allele frequencies may change to C=0.6 and T=0.4 due to a runaway train or, less spectacularly, sampling error. All things being equal, in the next generation, there's a greater chance that the C will increase and the T will decrease. If this trend continues for a few generations, the T's may disappear from the population entirely at which point the C allele is considered fixed in the population. Alleles can be fixed very rapidly if some individuals move away to form a new population. This is called the founder effect. As you can see, changes in allele frequencies don't require mutation or natural selection.

2.2.4 The Neutral Theory of Evolution

Molecular biology and the discovery of the genetic code had a profound effect on evolutionary biology. One shocking realization was that many sites for mutationfor example, the third position in a codon or a nucleotide in the middle of an intron (a term defined later), are expected to be invisible to natural selection. This led Motoo Kimura to propose the neutral theory of evolution. It was somewhat heretical when first proposed because it deemphasized the role of natural selection, but the theory states that the majority of sequence evolution is purely random, the product of mutation and drift.

Imagine what happens to a sequence as it accumulates random mutations over time. At first, the sequence is nearly identical to the original. If the rate of mutation is relatively consistent, you can count the number of mismatches to determine how much time has passed. This turns out to be very useful and forms the basis for determining the probability that a DNA sample matches a particular person, for example. Eventually, the number of mutations becomes so great that the sequence is no longer recognizably similar to the original. At this point, the sequence is saturated for mutation. Saturated sequences can't be used to measure time, but they are still very useful because they indicate which sequences aren't under selective pressure. By inference, those that remain similar over long periods of time are under selective pressure. As a practical example, when comparing the human and puffer fish genomes, you find that most of the conserved sequence is in genes.

One of the great debates of evolutionary biology is the relative importance of natural selection and neutral evolution in the formation of species. We don't need to be overly concerned with this argument because we're more interested in how sequences change over time, and for this we can observe actual sequence data.

2.2.5 Molecular Clocks

If you compare the sequences from related organisms, it is clear that certain positions don't change much over time while others change very rapidly. For example, parts of the ribosomal RNA are identical in every organism sequenced to date, from bacteria to humans. These subsequences are so important that if they change, the organism dies. Clearly, these are under intense selective pressure. There are other sites, such as third codon positions, that are only mildly affected by selection and tend to drift. There are even sequences, such as viral coat proteins, in which selection acts to promote variation, and these change very rapidly. Regardless of the underlying mechanism, it is possible to use the rate of change as a molecular clock.

If you know the mutation rate for a particular sequence, you can use it to determine how long ago two sequences diverged. Suppose you have the same protein sequence from both cats and dogs, and there are 10 differences between them. From the fossil record, you estimate that cats and dogs had a common ancestor 50 million years ago. Now when you compare the cat sequence to the same sequence in humans, you find 12 differences. You can now estimate that carnivores and humans shared a common ancestor 60 million years ago. We're using a very simple model here that treats all positions identically and we're not using real data, but this is the general idea behind molecular clocks.

The key to using molecular clocks is that the sequences must "tick" at the appropriate rate. The hypothetical protein in the last example is a poor choice for determining how long ago humans and chimps last shared a common ancestor because one difference here or there would lead to a large difference in the estimated time. Sequences that tick too fast are also not appropriate because they are prone to saturation.

2.2.6 Homology, Phylogeny, and Trees

When looking at the biological world around you, you see only what exists today. You can't get a clear picture of what the world looked like 100 million years ago. However, you can see relationships between organisms and make inferences. For example, you don't know what the last common ancestor of humans, chimpanzees, and gorillas looked like, but you can guess that it looked more like an ape than a bird. This is also the case at the sequence level; proteins from humans and chimps are much more similar to each other than either is to a bird. The study of relationships between organisms is called phylogenetics.

By definition, two sequences are homologous if they share a common ancestor. Two sequences are either homologous or they aren't. However, people often misuse the term and say something like "these two sequences are 80 percent homologous." What they usually mean is that two sequences are 80 percent identical and not that there is an 80 percent chance that they have a common ancestor. Determining if two sequences are indeed homologous requires making inferences. This isn't always a simple task; sometimes homology can be stated with near certainty, but not always. Sequences may appear to be related from chance similarity (or convergent evolution).

Sequence homology is further refined by the terms orthologous and paralogous. Sequences separated by speciation are called orthologs, while sequences separated by duplication are called paralogs. The genes for myoglobin in humans and mice are orthologs; they are the same gene in different species. If the myoglobin gene is duplicated in humans, the two myoglobins will be paralogs of each other. It's somewhat confusing, but both human paralogs would be considered orthologous to the mouse myoglobin. It is generally the case that the most similar genes between species are orthologs, and this is often used as an operational definition.

2.2.7 The Tree of Life

An introduction to molecular evolution would be incomplete without an overview of life on Earth. You may have learned in an introductory biology class that there are five taxonomic kingdoms (animals, plants, fungi, monera, and protista). This is based largely on what can be seen with your eyes or a microscope. Molecular biology opened up a new way to classify organisms based on sequences rather than external features. Figure 2-4 shows a tree for various organisms based on ribosomal DNA sequence. There are three obvious domains that Carl Woese called the Bacteria, Archaea, and Eucarya. Note that the arrow in the figure points to the root of the plants, animals, and fungi. From this perspective, the traditional five kingdoms are a bit nearsighted.

Figure 2-4. Tree of life based on rRNA sequence (Diagram courtesy of Norman Pace. Used with permission.)


gIn terms of genomes and overall cell structure, there are only two major divisions: the prokaryotes (bacteria and archaea) and eukaryotes. Except in rare cases, prokaryotes are microscopic organisms that are usually shaped like rods or spheres. Some of the more famous prokaryotes include Escherichia coli (a bacterium that lives in your gut and is a favorite model organism for microbiologists) and Yersinia pestis (the bacterium that causes bubonic plague). The major distinguishing feature of prokaryotes is that DNA replication, transcription, and translation all take place in the same compartment of the cell because there is only one compartment in the cell.

Eukaryotes come in many shapes and sizes, primarily because they can form multi-cellular organisms such as birds and trees. But some eukaryotes are simple, single-celled organisms such as Saccharomyces cereviseae (the yeast used for making beer). All eukaryotes have a nucleus (karya is Greek for nucleus) in which DNA is stored, in addition to other membranous organelles. Interestingly, most eukaryotes contain mitochondria. These organelles have their own genome and are descended from bacteria that long ago entered a cooperative relationship with eukaryotes. This is also true of chloroplasts, which are responsible for photosynthesis in plants. It is thought that eukaryotes are a fusion of two bacteria, one a Eubacteria and one an Archaebacteria. So the next time you munch on a carrot, you might consider how many genomes are really in there.

So far, this chapter has neglected viruses. Where do they fit in? By most definitions, viruses aren't even alive; they don't grow or have repair processes. Viruses seem to break every rule of biology. Some viruses infect prokaryotes and others that parasitize eukaryotes. Viruses come in many different shapes and have wildly different lifestyles. Some have genomes made from RNA instead of DNA, and others have single-stranded rather than double-stranded genomes.

2.3 Genomes and Genes

In general, the genomic structure of prokaryotes is very different from that of eukaryotes (Figure 2-5). Genomes are organized into chromosomes. Prokaryotes often have a single circular chromosome, and eukaryotes usually have multiple linear chromosomes. People are sometimes surprised to find that genome size and chromosome number aren't reflected in organismal complexity. For example, the single-celled Amoeba dubia has a genome that is about 200 times larger than the human genome. Although dogs and cats have very similar genome sizes, dogs have twice as many chromosomes. One rule to keep in mind when thinking about genomic organization is that genomes of viruses and prokaryotic organisms generally contain little noncoding sequence, whereas the genomes of more complex organisms usually contain a much higher percentage of noncoding sequence.

Figure 2-5. Prokaryote and eukaryote cells


2.3.1 Prokaryotic Genes

Prokaryotic genes are relatively simple. They contain a promoter that determines when the gene is transcribed and a coding region that contains the DNA sequence for a protein. It is relatively easy to find genes in prokaryotic genomes. Since stop codons are expected about every 21 triplets (there are three stop codons out of 64 total triplet combinations), long open reading frames (ORFs) should be very rare, at least from an unbiased random model. On average, proteins are 300 amino acids long, so finding an ORF that is 900 nucleotides long is really unexpected and a pretty clear signal that the ORF codes for a real protein. Of course, some genes encode small proteins, and finding these is a bit more difficult.

2.3.2 Eukaryotic Genes

Eukaryotic gene structure is more complicated than prokaryotic gene structure. Unlike prokaryotic genes, eukaryotic genes are often broken into pieces that are assembled before they are translated. Like prokaryotes, eukaryotes also have promoters to regulate when genes are turned on, but they are often much larger and may exist a great distance from the start of translation. In addition, many genes respond to additional sequences called enhancers and suppressors that aren't necessarily upstream of a gene and may be many kilobases away.

In eukaryotes, mRNAs are processed before they are translated (Figure 2-6). Two kinds of processing are common: splicing and poly-adenylation. Splicing brings together the coding sequences and throws out the intervening stuff. The sequences that end up in the mature mRNA are called exons, and the intervening stuff is called introns. The part of the mRNA that codes for protein is called the coding sequence (CDS), and the parts at either end are called untranslated regions (UTRs). The other common post-transcriptional modification is poly-adenylation. In this process, 50 or more adenine nucleotides are added to the end of the mRNA, which is called a poly-A tail.

Figure 2-6. Eukaryotic mRNA processing


2.3.3 Transcripts

To many people, the most interesting parts of a genome are its genes. However, genes may account for a small fraction of a genome. In the human genome, for example, only 1 to 2 percent of the sequence codes for proteins. So why not just sequence the proteins? This procedure turns out to be much more difficult than sequencing nucleotides, but you can sequence the transcripts that code for proteins. Using some clever molecular biology techniques, it's possible to separate mRNAs from the rest of the cellular material and in this way specifically select for protein-coding genes. However, the mRNAs aren't sequenced directly. First they are copied into complementary DNA (cDNA) by an enzyme called reverse transcriptase. This enzyme converts mRNA into DNA, flouting the first rule, which is the Central Dogma of Molecular Biology. A collection of cDNAs is called a cDNA library, and it is common to have cDNA libraries from many kinds of tissues. The mRNAs present in the liver may be very different from those in the brain (the tissues have very different properties due to different collections of proteins). If you're interested in certain cancers, for example, you might develop and sequence cDNA libraries derived from specific types of tumors.

In the world of sequencing, it is therefore common to find cDNA sequencing projects in addition to, or instead of, genome sequencing projects. The downside to cDNA sequencing is that many interesting sequences aren't transcribed, and those that are transcribed may be difficult to capture if they aren't abundant. In your quest for jargon compliance, note that sequencing reads from cDNA sequences are often called expressed sequence tags (ESTs). You will probably come across this term frequently in your BLAST searches.

2.3.4 Repeats

Repeats are one of the most mysterious features of genomes. All genomes sequenced to date contain some form of repeat, but the big eukaryotic genomes are richest. About half the human genome is easily recognized as repetitive. Understanding repeats is critical to BLAST users because if they aren't dealt with properly, they can tie up your computer for days, dominate your report, invalidate your statistics, and obscure your findings.

The words "repeat" and "repetitive sequence" are used very loosely in genomics, and this causes a lot of confusion for novices. Broadly speaking, repeats can be classified as simple and complex. Simple repeats generally consist of low-complexity sequences (see Chapter 4); examples include runs of a single nucleotide such as An, Tn, Gn, and Cn; dinucleotide repeats such as [CA]n; tri-nucleotide repeats in the form of [CAG]n; and so on. The strange thing about these sequences is that they occur much more frequently in genomes than you'd expect by chance. Simple sequence repeats occur just about everywhere in the genome, even in the protein coding exons of genes, but they are especially common in heterochromatic, pericentromeric, and telomericregions of eukaryotic chromosomes that play structural roles and don't contain many genes.

The term complex repeat usually describes any genomic repeat that doesn't consist of low complexity/low entropy sequence. Noncoding RNAs, such as rRNAs and tRNAs, comprise one commonly encountered class of complex repeat, but because they have known important functions, they are often not lumped together with the rest of the repeats. The term complex repeat can also denote some form of mobile genetic element or selfish DNA (a phrase coined by Francis Crick). These entities are a bit like the fleas and ticks of the genome: they copy and spread themselves within and between genomes and are generally believed to do little for the host genome. Selfish DNAs are usually further classified into three subcategories: transposons, retroviruses, and retrotransposons. If you see these names in a BLAST report, you may need to use a repeat filter.

2.3.5 Pseudogenes

One of the most confounding problems in similarity searches is the presence of pseudogenes. As the name suggests, pseudogenes are "fake genes"; that is, they look like they could encode a protein, but they aren't functional. Pseudogenes come from a variety of sources. A mutation that introduces a stop codon into a gene creates a pseudogene, but more commonly, pseudogenes are created from some kind of duplication event. Sometimes, through various mechanisms, regions of a chromosome may become duplicated. The extra copies of genes are generally free of selective pressures and may become pseudogenes as they accumulate mutations. Duplication may also result from repetitive elements that include neighboring DNA as they copy themselves into new locations. In eukaryotes, a very common form of pseudogene is the retro-pseudogene, in which the mRNA from a gene is reverse-transcribed into DNA and inserted back into the genome. Because retro-pseudogenes come from mRNA, they contain the hallmarks of transcripts, notably an absence of introns and the presence of a poly-A tail. They are therefore easy to detect if you know what to look for. Most retro-pseudogenes come from highly transcribed genes such as the protein components of the ribosome.

2.4 Biological Sequences and Similarity

The beginning of this chapter asked why biological sequences are similar to one another. Let's answer that question now. You've seen that biological sequences like proteins may have important functions necessary for the survival of an organism. You've also seen that DNA sequence can mutate randomly, and this may change how a sequence functions. Over time, both functional constraints and random processes impact the course of sequence evolution. The degree to which a sequence follows a functional or random path depends on natural selection and neutral evolution. So the reason why sequences are similar to one another is because they start out similar to one another and follow different paths. When you read a BLAST report, you will find that your knowledge of molecular and evolutionary biology will help you interpret the similarities and differences you see.

2.5 Further Reading

Genetics, molecular biology, and evolution aren't especially difficult topics, but they are filled with many potentially unfamiliar terms. The following books are recommended for those just getting started in these fields. They are informative and entertaining, and can help more experienced readers communicate effectively with novices.

Clark, David P. and Lonnie D. Russell, Molecular Biology Made Simple and Fun (Cache River Press).

Gonick, Larry and Mark Wheelis, The Cartoon Guide to Genetics (Perennial).

Tagliaferro, Linda and Mark Vincent Bloom, The Complete Idiot's Guide to Decoding Your Genes (Alpha Books).

The following are typical textbooks for college-level courses in molecular biology, genetics, and evolution:

Alberts, Brooks et al., Molecular Biology of the Cell (Garland).

Futuyma, Douglas J., Evolutionary Biology (Sinauer Associates, Inc.).

Graur, Dan and Wen-Hsiung Li, Molecular Evolution (Sinauer).

Hartl, Daniel L. and Elizabeth W. Jones, Genetics: Analysis of Genes and Genomes, (Jones & Bartlett).

Lewin, Benjamin, Genes VII (Oxford University Press).

Lodish, Harvey et. al., Molecular Cell Biology (W.H. Freeman & Co.).

Page, Roderic D. M. and Edward C. Holmes, Molecular Evolution: A Phylogenetic Approach (Blackwell Science).

Watson, James D. and Joan Steitz. Molecular Biology of the Gene (Addison-Wesley).