BLAST (Basic Local Alignment Search Tool)

Chapter 3. Sequence Alignment

BLAST finds statistically significant similarities between sequences by evaluating alignments, but how are sequences aligned? In principle, there are many ways to align two sequences, but in practice, one method is used more often than any other. This chapter explains this technique with the biologist in mind, without using the mathematical notation and jargon that is usually employed to describe such algorithm. Divested of unfamiliar language and notation, these algorithms are quite simple.

Finding the optimal alignment between two sequences can be a computationally complex task. Fortunately, a technique called dynamic programming (DP) makes sequence alignment tractable as long as you follow a few rules. Rather than have you struggle with a confusing definition of DP, this chapter demonstrates how the technique works for sequence alignment and then gets back to the generalities. There are fundamentally two kinds of alignment: global and local. In global alignment, both sequences are aligned along their entire lengths and the best alignment is found. In local alignment, the best subsequence alignment is found. For example, if you want to find the two most similar sentences between two books, you use local alignment. If you want to compare the sentences end to end, use global alignment. This chapter describes global alignment, then local alignment. The example uses English words instead of biological sequences and the algorithms are quite general.

3.1 Global Alignment: Needleman-Wunsch

The global alignment algorithm described here is called the Needleman-Wunsch algorithm. We will explain it in a way that seems natural to biologists, that is, it tells the end of the story first, and then fills in the details. (This is why biologists make terrible comedians; they always tell the punch line first.) We will align the words COELANCANTH and PELICAN using a simple scoring scheme: +1 for letters that match, -1 for mismatches, and -1 for gaps. The alignment will eventually look like one of the following, which are equivalent given our scoring scheme:

COELACANTH      COELACANTH

P-ELICAN--      -PELICAN--

Note that every letter of each sequence is aligned to a letter or a gap. In local alignments, discussed later, this isn't the case.

The alignment takes place in a two-dimensional matrix in which each cell corresponds to a pairing of one letter from each sequence. To get an intuitive understanding of the alignment algorithm, look at Figure 3-1, which shows where the maximum scoring alignment lies in the matrix. The alignment starts at the upper left and follows a mostly diagonal path down and to the right. When two letters are aligned, the path follows a diagonal trajectory. There are several places in which the letters from COELACANTH are paired to gap characters. In this case, the graph is followed horizontally. Although not shown here, the path may be also be followed vertically when the letters from PELICAN are paired with gap characters. Gap characters can never be paired to each other. Note that the first row and column are blank. The reason for this will become clear shortly.

Figure 3-1. Example of an alignment matrix

figs/blst_0301.gif

In reality, you don't store letters in the matrix as shown in Figure 3-1. Each cell of the matrix actually contains two values: a score and a pointer. The score is derived from the scoring scheme. Here, this means +1 or -1, but when aligning biological sequences, the values come from a scoring matrix (a topic of the next chapter). The pointer is a directional indicator (an arrow) that points up, left, or diagonally up and left. The pointer navigates the matrix, and its use will become clearer later in the chapter. Now, let's look at the algorithm in detail. There are three major phases: initialization, fill, and trace-back.

3.1.1 Initialization

In the initialization phase, you assign values for the first row and column (Figure 3-2). The next stage of the algorithm depends on this. The score of each cell is set to the gap score multiplied by the distance from the origin. Gaps may be present at the beginning of either sequence, and their cost is the same as anywhere else. The arrows all point back to the origin, which ensures that alignments go all the way back to the origin (a requirement for global alignment).

Figure 3-2. Initialization of the alignment matrix

figs/blst_0302.gif

3.1.2 Fill

In the fill phase (also called induction), the entire matrix is filled with scores and pointers using a simple operation that requires the scores from the diagonal, vertical, and horizontal neighboring cells. You will compute three scores: a match score, a vertical gap score, and a horizontal gap score. The match score is the sum of the diagonal cell score and the score for a match (+1 or -1). The horizontal gap score is the sum of the cell to the left and the gap score (-1), and the vertical gap score is computed analogously. Once you've computed these scores, assign the maximum value to the cell and point the arrow in the direction of the maximum score. Continue this operation until the entire matrix is filled, and each cell contains the score and pointer to the best possible alignment at that point.

If you look at the initialized matrix, you'll find that there's only one cell where you can compute the maximum score because there's only one cell with the required 3 neighbors. Now you can see why you needed to leave one extra column and row and why you needed the initialization phase. Without them, you wouldn't be able to start the fill. Ignoring the rest of the table, compute this cell (the one that matches C to P).

The match score is the sum of the preceding diagonal cell (score = 0) and the score for aligning C to P (-1). The total match score is -1. The horizontal gap score is the sum of the score to the left (-1) and the gap score (-1). The horizontal gap score is therefore -2. The same is true for the vertical gap score. Your maximum score is therefore the diagonal score (-1), and the pointer is set to the diagonal (Figure 3-3). Now that this first cell is computed, you can compute the cell to the right or the cell below. Calculate one more cell, the right neighbor (the one that aligns O and P).

Figure 3-3. Beginning to fill the alignment matrix

figs/blst_0303.gif

The match score is the sum of the preceding diagonal cell (-1) and the mismatch score (-1), which equals -2. The horizontal gap score is the score of the cell to the left (-1) and the gap penalty (-1), which is also -2. The vertical gap score is the cell above (-2) and the gap penalty (-1), which totals -3. The maximum score is -2, but this can come from the diagonal or from the left. This is where you must make a consistent, arbitrary choice—for example, always choose the diagonal over a gap.

The matrix up to this point is shown in Figure 3-4. It may seem rather trivial, but the maximum global alignment between CO and P can be calculated. It has a score of -2 and has either of the following equivalent forms:

CO    CO

-P    P-

Figure 3-4. Second cell filled in the alignment matrix

figs/blst_0304.gif

Using the same maximizing procedure for each cell, you can fill the entire matrix. At the end of the fill, the matrix appears as in Figure 3-5.

Figure 3-5. Filled alignment matrix

figs/blst_0305.gif

3.1.3 Trace-Back

The trace-back lets you recover the alignment from the matrix. Like the other parts of this algorithm, it's pretty simple. Start at the bottom-right corner and follow the arrows until you get to the beginning. To produce the alignment, at each cell, write out the corresponding letters or a hyphen for the gap symbol. Since you're following it from the end to the start, the alignment will be backward, and you just reverse it. The final alignment looks like this:

COELACANTH

-PELICAN-

Example 3-1 shows a Perl script.

Example 3-1. Trace-back with Needleman-Wunsch algorithm

# Needleman-Wunsch  Algorithm

# usage statement

die "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV == 2;

# get sequences from command line

my ($seq1, $seq2) = @ARGV;

# scoring scheme

my $MATCH    =  1; # +1 for letters that match

my $MISMATCH = -1; # -1 for letters that mismatch

my $GAP      = -1; # -1 for any gap

# initialization

my @matrix;

$matrix[0][0]{score}   = 0;

$matrix[0][0]{pointer} = "none";

for(my $j = 1; $j <= length($seq1); $j++) {

    $matrix[0][$j]{score}   = $GAP * $j;

    $matrix[0][$j]{pointer} = "left";

}

for (my $i = 1; $i <= length($seq2); $i++) {

    $matrix[$i][0]{score}   = $GAP * $i;

    $matrix[$i][0]{pointer} = "up";

}

# fill

for(my $i = 1; $i <= length($seq2); $i++) {

    for(my $j = 1; $j <= length($seq1); $j++) {

my ($diagonal_score, $left_score, $up_score);

# calculate match score

my $letter1 = substr($seq1, $j-1, 1);

my $letter2 = substr($seq2, $i-1, 1);       

if ($letter1 eq $letter2) {

     $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;

}

else {

     $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;

}

# calculate gap scores

$up_score   = $matrix[$i-1][$j]{score} + $GAP;

$left_score = $matrix[$i][$j-1]{score} + $GAP;

# choose best score

if ($diagonal_score >= $up_score) {

     if ($diagonal_score >= $left_score) {

  $matrix[$i][$j]{score}   = $diagonal_score;

  $matrix[$i][$j]{pointer} = "diagonal";

     }

else {

  $matrix[$i][$j]{score}   = $left_score;

  $matrix[$i][$j]{pointer} = "left";

     }

} else {

     if ($up_score >= $left_score) {

  $matrix[$i][$j]{score}   = $up_score;

  $matrix[$i][$j]{pointer} = "up";

     }

     else {

  $matrix[$i][$j]{score}   = $left_score;

  $matrix[$i][$j]{pointer} = "left";

     }

}

    }

}

# trace-back

my $align1 = "";

my $align2 = "";

# start at last cell of matrix

my $j = length($seq1);

my $i = length($seq2);

while (1) {

    last if $matrix[$i][$j]{pointer} eq "none"; # ends at first cell of matrix

    if ($matrix[$i][$j]{pointer} eq "diagonal") {

$align1 .= substr($seq1, $j-1, 1);

$align2 .= substr($seq2, $i-1, 1);

$i--;

$j--;

    }

    elsif ($matrix[$i][$j]{pointer} eq "left") {

$align1 .= substr($seq1, $j-1, 1);

$align2 .= "-";

$j--;

    }

    elsif ($matrix[$i][$j]{pointer} eq "up") {

$align1 .= "-";

$align2 .= substr($seq2, $i-1, 1);

$i--;

    }   

}

$align1 = reverse $align1;

$align2 = reverse $align2;

print "$align1\n";

print "$align2\n";

3.2 Local Alignment: Smith-Waterman

The local alignment algorithm we describe here, the Smith-Waterman algorithm, is a very simple modification of Needleman-Wunsch. There are only three changes:

· The edges of the matrix are initialized to 0 instead of increasing gap penalties.

· The maximum score is never less than 0, and no pointer is recorded unless the score is greater than 0.

· The trace-back starts from the highest score in the matrix (rather than at the end of the matrix) and ends at a score of 0 (rather than the start of the matrix).

These simple changes have a profound effect on the behavior of algorithm. Using the same words and scoring scheme as you did in global alignment, look at the filled matrix in Figure 3-6.

Figure 3-6. Filled Smith-Waterman alignment matrix

figs/blst_0306.gif

As you can see, there are a lot of zeroes. That's because there are a lot of places where you can't get a positive score. Also note that one cell is circled. This is the cell with the maximum score in the matrix. There may be more than one place in the matrix with the same maximum score, and in this case, some kind of arbitrary decision must be made. Start your trace-back from the maximum score and follow it to a score of zero, creating your alignment as you step backward, just as you did with global alignment. At the end, you have an alignment with a score of 4 that looks like this:

ELACAN

ELICAN

Example 3-2 shows the Perl code.

Example 3-2. Local alignment with the Smith-Waterman algorithm

# Smith-Waterman  Algorithm

# usage statement

die "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV == 2;

# get sequences from command line

my ($seq1, $seq2) = @ARGV;

# scoring scheme

my $MATCH    =  1; # +1 for letters that match

my $MISMATCH = -1; # -1 for letters that mismatch

my $GAP      = -1; # -1 for any gap

# initialization

my @matrix;

$matrix[0][0]{score}   = 0;

$matrix[0][0]{pointer} = "none";

for(my $j = 1; $j <= length($seq1); $j++) {

    $matrix[0][$j]{score}   = 0;

    $matrix[0][$j]{pointer} = "none";

}

for (my $i = 1; $i <= length($seq2); $i++) {

    $matrix[$i][0]{score}   = 0;

    $matrix[$i][0]{pointer} = "none";

}

# fill

my $max_i     = 0;

my $max_j     = 0;

my $max_score = 0;

for(my $i = 1; $i <= length($seq2); $i++) {

    for(my $j = 1; $j <= length($seq1); $j++) {

my ($diagonal_score, $left_score, $up_score);

# calculate match score

my $letter1 = substr($seq1, $j-1, 1);

my $letter2 = substr($seq2, $i-1, 1);      

if ($letter1 eq $letter2) {

     $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;

}

else {

     $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;

}

# calculate gap scores

$up_score   = $matrix[$i-1][$j]{score} + $GAP;

$left_score = $matrix[$i][$j-1]{score} + $GAP;

if ($diagonal_score <= 0 and $up_score <= 0 and $left_score <= 0) {

     $matrix[$i][$j]{score}   = 0;

     $matrix[$i][$j]{pointer} = "none";

     next; # terminate this iteration of the loop

}

# choose best score

if ($diagonal_score >= $up_score) {

     if ($diagonal_score >= $left_score) {

  $matrix[$i][$j]{score}   = $diagonal_score;

  $matrix[$i][$j]{pointer} = "diagonal";

     }

     else {

  $matrix[$i][$j]{score}   = $left_score;

  $matrix[$i][$j]{pointer} = "left";

     }

} else {

     if ($up_score >= $left_score) {

  $matrix[$i][$j]{score}   = $up_score;

  $matrix[$i][$j]{pointer} = "up";

     }

     else {

  $matrix[$i][$j]{score}   = $left_score;

  $matrix[$i][$j]{pointer} = "left";

     }

}

# set maximum score

if ($matrix[$i][$j]{score} > $max_score) {

     $max_i     = $i;

     $max_j     = $j;

     $max_score = $matrix[$i][$j]{score};

}

    }

}

# trace-back

my $align1 = "";

my $align2 = "";

my $j = $max_j;

my $i = $max_i;

while (1) {

    last if $matrix[$i][$j]{pointer} eq "none";

    if ($matrix[$i][$j]{pointer} eq "diagonal") {

$align1 .= substr($seq1, $j-1, 1);

$align2 .= substr($seq2, $i-1, 1);

$i--; $j--;

    }

    elsif ($matrix[$i][$j]{pointer} eq "left") {

$align1 .= substr($seq1, $j-1, 1);

$align2 .= "-";

$j--;

    }

    elsif ($matrix[$i][$j]{pointer} eq "up") {

$align1 .= "-";

$align2 .= substr($seq2, $i-1, 1);

        $i--;

    }  

}

$align1 = reverse $align1;

$align2 = reverse $align2;

print "$align1\n";

print "$align2\n";

3.3 Dynamic Programming

Now that you've seen the typical approach to global and local alignment, consider the generality of dynamic programming. The advantage of DP can be seen in the fill phase. Each cell represents the maximum scoring alignment between the two sequences up to that point. When you calculate the next cell, you use previously stored values. In other words, DP is an optimizing function whose definition is extended as the algorithm proceeds.

3.4 Algorithmic Complexity

The complexity of algorithms is often described in big-O notation, a shorthand for the asymptotic behavior of the algorithm. For example, searching for a name in a phone book by starting at the beginning and going name-by-name takes on average, n/2 operations, where n is the number of names in the phone book. Such a search has O(n) time complexity (constants are dropped from the notation). It scales linearly in time; a phone book twice as long takes twice as long to search. An approach that scales more efficiently successively splits the phone book in half based on the alphabetical order. This is called a binary search and has O(log2n) complexity. For example, a phone book eight times longer takes only three times longer to search. The alignment algorithms as described have O(nm) complexity in both time and memory, where n and m are the lengths of the sequences. It's also common to label such algorithms as O(n2) and speak of them as scaling quadratically.

3.5 Global Versus Local

Although global and local alignment are mechanistically similar, they have very different properties. Consider the alignment between the genomic sequence of two eukaryotic genes from distantly related organisms. You'd expect the exons to remain the same because their coding sequences are evolutionarily constrained, but the introns may no longer be recognizably similar, especially if they have acquired many insertions or deletions. The problem is that exons may account for only 1 to 2 percent of the sequence. As a result, a global alignment between these sequences is an alignment of mostly random letters. In such a scenario, it's very likely (especially if introns change size, as they often do) that the exons will not align to one another because their score contribution is very small compared to the rest of the sequence. In contrast, local alignment can pick out conserved exons, but unfortunately, just the maximum scoring one. The shortcomings of the standard alignment algorithms have been addressed by numerous variants.

3.6 Variations

The algorithms as described and implemented earlier are rarely used. Over the years, important innovations have made the general algorithms more applicable to aligning biological sequences and running efficiently in a computer.

3.6.1 Gap Modifications

Most alignment algorithms employ affine gaps. The previous description used a simple gap-scoring scheme in which every letter inserted or deleted cost the same amount. This isn't a very good model of biology because large gaps tend to occur fairly often. To better model this phenomenon, affine gaps are used where there is a greater penalty for opening a gap than extending the gap. Figure 3-7 shows a graphical view.

Figure 3-7. Comparison of simple and affine gap scoring schemes

figs/blst_0307.gif

Affine gap penalties are a simple modification to either algorithm. All you need to do is to record a third value in each cell of the matrix that keeps track of whether a gap has already been opened or not and then assign the appropriate gap penalty. Some algorithms even allow multiple gap scoring schemes so that very long gaps are not penalized as much. You can visualize how this works by following the minimal penalty in Figure 3-7. These algorithms are slightly more complicated because scores for each affine gap must be tracked. Usually two affine penalties are employed, and the algorithms are labeled double affine.

3.6.2 Reduced Memory

You shouldn't attempt to align long sequences (e.g., genomic DNA) with the versions of Needleman-Wunsch and Smith-Waterman described in Examples Example 3-1 and Example 3-2. The number of cells in the alignment matrix is the product of the sequence lengths, and each cell may take 8 bytes or more of memory. Therefore, aligning two 100,000-bp DNA sequences requires approximately 80 gigabytes (GB) of RAM. Fortunately, there are linear-memory versions of the algorithms.

You may have noticed during the fill phase of the algorithms that you use only two rows at a time. The other rows just sit there, either waiting to be calculated or waiting for the trace-back. Why not just use two rows at a time and not allocate the whole matrix? If you do this, you can calculate the score of the maximum scoring alignment and record its ending coordinates. By reversing direction, you can then compute the starting coordinates. Unfortunately, you lose the ability to perform a trace-back. But the memory required is now just two times the length of one of the sequences rather than the product of the two sequences. Using this approach, you can compare two 100,000-bp DNA sequences in just 1.6 megabytes (MB) of RAM. The alignment algorithm is now O(n) in memory, but still O(nm) in time.

How do you get the actual alignment? Given the coordinates of the maximum scoring alignment you can restrict the search space to a region between the two endpoints (Figure 3-8). With this approach, a diagonal line is drawn between the endpoints, and the search space is restricted to a certain number of cells on either side of the diagonal. The width of the search space is called the bandwidth. If the bandwidth is too small, the alignment will walk along the edge of the band and may no longer follow the path of the maximum score. If the bandwidth is needlessly large, both memory and computation are wasted. Now that the search space is sufficiently small, you can perform a complete fill and trace-back. However, if the alignment is especially large, even this restricted space can be quite large. For this reason, it is common to use a divide-and-conquer strategy that requires twice the computation, but very little memory.

Figure 3-8. Banded alignment

figs/blst_0308.gif

In Example 3-2, most of the cells had a score of zero. Doesn't it seem a waste of time and memory to explore regions in which the score is poor? Why not make alignments only where positive scores are likely? That's exactly the thinking behind BLAST, as discussed in Chapter 5.

3.6.3 Aligning Transcripts to Genomic Sequence

Determining the correct exon-intron structure of genes isn't always easy. One very successful approach is to align transcripts back to their origin in a genome. However, this isn't as simple as it may appear. Several solutions to this problem use either global or local techniques.

A global alignment between a transcript and a genomic sequence is expected to have huge gaps corresponding to the introns. In eukaryotic genomes, such as the human genome, exons may account for only 1 to 2 percent of a genome. As a result, the gap scores may completely dominate the scoring function, and the alignment may be of little consequence. If the gap costs are too low, the alignment may spread out, and exons may not be faithfully aligned. These problems are largely solved by gapping with double affine penalties, but there are still potential problems with short exons and introns.

A standard local alignment between a transcript and a genome typically identifies the longest exon as the maximum scoring pair. This isn't as useful, but many local alignment algorithms, like BLAST, produce more than one alignment. With these variants, mapping a transcript back to a genome is simply a matter of chaining the individual alignments together. This turns out to be another tricky problem, but it works well most of the time

3.7 Final Thoughts

When you use the Needleman-Wunsch or Smith-Waterman algorithms to find the maximum scoring alignment, you're playing by computational, not biological, rules. As a result, the maximum scoring alignment only approximates the truth. However, even if all the nuances of biology were clear and you could code this in a computer algorithm, you might still favor the approximation because the computational cost of the correct algorithm can be excessive. In any case, the fact that you can align the unrelated words pelican and coelacanth merits consideration. It's possible to align any sequence; finding proper meaning in alignments is up to the user, not the algorithm.

3.8 Further Reading

For more information on the Perl programming language, consider these books:

Christiansen, Tom and Nathan Torkington, Perl Cookbook (O'Reilly & Associates).

Schwartz, Randal L. and Tom Phoenix, Learning Perl (O'Reilly).

Tisdall, James, Beginning Perl for Bioinformatics (O'Reilly).

Wall, Larry, Tom Christiansen, and Jon Orwant, Programming Perl (O'Reilly).

The following texts are indispensable resources for information on sequence alignment and algorithms in general:

Cormen, Thomas H. et al., Introduction to Algorithms (MIT Press).

Gusfield, Dan, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Cambridge University Press).