BLAST statistics are everywhere in biology today. In fact, it's hard to find a molecularbiology paper, grant proposal, patent application, or biotech business plan that doesn't refer to the Expect or Pvalue of a BLAST result. The BLAST Expect has so permeated biological thinking in recent years that for many scientists it has become synonymous with biological truth. Tell a colleague that you've just cloned a gene that's homologous to something trendy, and odds are that he will ask what the Expect of its alignment was in a BLAST search.
Of course, what some see as a sweeping change, others label a fad. While some researchers consider BLAST statistics a welcome injection of mathematical rigor into the biological world, others lament the abandonment of biological insight for faith in a number. No matter where you stand on this issue, there is no avoiding the reality of BLAST statistics in today's bioinformatics workplace. Understanding what the numbers in a BLAST report mean and how they are derived isn't just for mathematicians; it's a realworld survival skill for biologists and bioinformatics professionals in academia and industry alike.
The material covered in this chapter is practical rather than theoretical in nature. Chapter 4 summarized some of the theory behind local alignment statistics. Read that chapter to learn more about the basic parameters of BLAST: l, k, and H. This chapter shows how to calculate the numbers in a BLAST report and use this knowledge to better understand the results of a BLAST search.
7.1 Basic BLAST Statistics
Primary BLAST literature doesn't focus on the arithmetic involved in calculating an Expect. Understandably, the papers discuss the theoretical underpinnings of KarlinAltschul statistics, leaving BLAST users without the kinds of examples found in a text book. This chapter provides this missing information.
Figure 71 summarizes the basic operations involved in converting the raw score of a highscoring pair into an Expect, or Pvalue. The following discussion explains how to perform each calculation and any accessory operations required to derive an Expect.
Figure 71. The essential calculations involved in converting an HSP's raw score into a probability
Calculating KarlinAltschul statistics isn't easily done with a pencil, paper, and calculator. Applying KarlinAltschul statistics to an HSP and avoiding a migraine requires code. We use simple Perl subroutines to take the grind out of the calculations and help clarify the equations. Hopefully these subroutines will also provide a basis for further investigations of your own BLAST results. If you'd like to add these subroutines to your codebase, you can download them from this book's web site; the module name is BlastStats.pm.
Example 71 shows a sample HSP from a BLASTP search. Use this alignment as a reference to check your calculations.^{[1]} For our present purposes, the details of the search aren't important, but the values given in the header and footer of the report are. The header and footer are omitted from the search that generated this alignment.
^{[1]} If you plan to perform the calculations described in this chapter and have searched the NCBI web site (http://www.ncbi.nlm.nih.gov/BLAST), make sure you don't select the compositionbased statistics in the Advanced Options checkbox because the values of the compositionbased lambdas aren't given in the report. Currently, it's selected by default for BLASTP, and you unselect it by clicking on it.
Example 71. A sample HSP from a BLASTP search
>CT12121 translation from_gene[CG3603 CG3603 FBgn0029648]
seq_release:2 mol_weight=18568
cds_boundaries:(X:2,699,992..2,700,667[+]) aa_length:176
transcript_info:[CT12121 seq_release:2] gene_info:[gene
symbol:CG3603 FBgn0029648
gene_boundaries:(X:2,699,935..2,700,670[+]) (GO:0016614
"oxidoreductase, acting on CHOH group of donors")]
Length = 176
Score = 70.9 bits (172), Expect = 8e13
Identities = 49/170 (28%), Positives = 85/170 (49%), Gaps = 6/170 (3%)
Query: 50 IAGEVAVVTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAKAYKA 109
+AG+VA+VTGAG G+GRA LA+ G + VD N+ A++TV Q++ R+ A +
Sbjct: 6 LAGKVALVTGAGSGIGRATCRLLARDGAKVIAVDRNLKAAQETVQELGSERSAALEV 62
Query: 110 NVTNYDDLVELNSKVVEEMGPVTVLVNNAGVMMHRNMFNPDPADVQLMINVNLTSHFWT 168
+V++ + ++ +++ T++VN+AG+ + D + VNL F
Sbjct: 63 DVSSAQSVQFSVAEALKKFQQAPTIVVNSAGITRDGYLLKMPERDYDDVYGVNLKGTFLV 122
Query: 169 KLVFLPKMKELRKGFIVTISSLAGVFPLPYSATYTTTKSGALAHMRTL 216
+ M ++L G IV +SS+ A Y TK+G ++ L
Sbjct: 123 TQAYAKAMIEQKLENGTIVNLSSIVAKMNNVGQANYAATKAGVISFTERL 172
Table 71. The parameters and their values required for KarlinAltschul statistical calculations 

Parameter 
Value 
l 
0.267 nats (gapped) 
k 
0.0410 nats (gapped) 
H 
0.140 nats/aligned residue 
m 
321 (length of the query sequence) 
n 
9418064 (number of letters in the database searched) 
Effective HSP length 
99 
Number of sequences in database 
17878 
7.1.1 Actual Versus Effective Lengths
The KarlinAltschul equation is probably the most recognized equation in bioinformatics. It states that the number of alignments expected by chance (E) during a sequence database search is a function of the size of the search space (m*n), the normalized score (lS), and a minor constant (k). (For more information about the KarlinAltschul equation and its associated parameters, see Chapter 4). The first step toward understanding how to use the KarlinAltschul equation is to understand what values for m and n BLAST uses when calculating an Expect. Many users assume that the m and n in the equation refer to the length of the query and subject sequences; this isn't the case. In practice, you use the effective lengths rather than actual lengths of the query and database when calculating the Expect of an HSP.
The terms used in the BLAST literature to denote the actualand effective lengths of the query, subject, and database can be confusing. Sometimes m and n denote actual lengths, and sometimes they denote the effective lengths. To avoid confusion, this chapter uses m and n to denote actual lengths and m´ and n´ to denote effective lengths.
The effective length of the query, m´, is its actual length minus the expected HSP length. The effective length of the database, n´, is the sum of the effective length of every sequence within it. In other words, if the sequence database used in a BLAST search contains only a single sequence, the actual length of that database, n, is the length of that sequence; the effective length, n´, of that database is the effective length of the single sequence it contains. If the database contains two sequences, its actual length is the sum of the lengths of those two sequences; the effective length of that database is the sum of those two sequences' effective lengths.
Calculating m´ and n´ requires knowing the expected HSP length, which is the length of an HSP that has an Expect of 1. You can find this value in the NCBIBLAST report footer (where it's called, confusingly, effective HSP length). You can also calculate it, as you'll learn later in this chapter. For now, though, use the value from Table 71. Here are two Perl functions used to calculate m´ and n´:
sub effectiveLengthSeq {
my $m = shift; # actual length sequence
my $expected_HSP_length = shift;
my $k = shift; # gapped k
my $m_prime = $m  $expected_HSP_length;
if ($m_prime < 1/$k){
return 1/$k;
}
else {
return $m_prime;
}
}
sub effectiveLengthDb {
my $n = shift; # actual length
my $expected_HSP_length = shift;
my $num_seqs_in_db = shift;
my $k = shift; # gapped k
my $n_prime = $n  ($num_seqs_in_db*$expected_HSP_length);
if ($n_prime < 1/$k){
return 1/$k;
}
else {
return $n_prime;
}
}
Notice that no effective length of the query or the database can ever be less than 1/k. Setting an effective length to 1/k basically amounts to ignoring a short sequence for statistical purposes; in cases when both m and n are less than 1/k, BLAST searches are illadvised. For more information about effective lengths and short sequences, see Chapter 4.
Using these functions with the information provided in Table 71 gives a value of 222 and 7648142 for m´ and n´, respectively.
7.1.2 The Raw Score and Bit Score
BLAST reports two scores: the raw score and a normalized score called abit score. The raw score, S, is the sum of the alignment's pairwise scores using a specific scoring matrix (see Chapter 4 for more information). When deriving the bit score for an HSP, first convert its raw score to a nat score using the following equation:
Because BLAST reports l in nats (or more precisely, nats per unit raw score) rather than bits, you must divide the nat score, S´, by ln(2) to convert it to a bit score. As you become a more critical user and interpreter of BLAST reports, you may find yourself converting nats to bits and raw scores to bit scores regularly. The following two Perl subroutines convert bits to nats, and vice versa:
sub nToB {
my $n = shift;
return $n/log(2); # converts nats to bits
}
sub bToN {
my $n = shift;
return $n*log(2); # converts bits to nats
}
Try using l and k tocalculate the bit score for the alignment shown in Figure 71. If you use the following Perl function and the values of kand l given in Table 71, you get a bit score of 70.9 for a raw score of 172.
sub rawScoreToBitScore {
my $raw_score = shift;
my $k = shift;
my $l = shift; #lambda in nats
# S'_{bits} = (l_{nats}S  ln (k))/ln(2)
return nToB($l*$raw_score  log($k));
}
7.1.3 The Expect of an HSP
Now calculate the Expect for the HSP shown in Figure 71, recalling that . Again, a simple Perl function is useful:
sub rawScoreToExpect {
my $raw_score = shift;
my $k = shift;
my $l = shift; # lambda in nats
my $m = shift; # effective length of query
my $n = shift; # effective length of database
# E = km'n'e^{}^{l}^{S}
return $k*$m*$n*exp(1*$l*$raw_score);
}
Using this function, the values of kand l, given in Table 71, combined with the values m´ (222) and n´ (7648142) that you calculated in your discussion of effective lengths, gives an Expect of 8e^{13} for the alignment shown in Figure 71.
You can also calculate the Expect of an alignment with a normalized score, S´ (Figure 71). The KarlinAltschul equation is formulated for the raw score, S, not the normalized score S´. To calculate an Expect using a normalized score S´ whose units are nats, use the equation E = m'n'e^{S'}. Note that k doesn't appear in this equation; it has already been accounted for when deriving the normalized (nat) score (e.g., ).
To calculate the Expect of an HSP from its bit score (Figure 71) use the Perl function shown next. The formula is similar to that used to calculate an Expect from a nat score. However, the base of the exponent is 2 rather than e because you're using bits rather than nats.
sub bitScoreToExpect {
my $bit_score = shift;
my $m = shift; # effective length query
my $n = shift; # effective length of database
# reformulated for bits
# E = m'n'2^{bit_score}
return $m*$n*2**(1*$bit_score);
}
7.1.4 The WUBLAST PValue
Another important difference between WUBLAST and NCBIBLAST is that WUBLAST reports a Pvalue as well as an Expect for an alignment. The two functions shown below will convert between these two related measures of statistical significance. Because P is equal to 1e^{E}, PE if either value is less than 0.001.
sub EtoP {
my $e = shift;
# P = 1  e^{E}
return 1exp(1*$e);
}
sub PtoE {
my $p = shift;
# E = ln(1  P)
return 1*log(1$p);
}
Some of the calculations discussed in this chapter don't apply to sum statistics. The sum score for a set of alignments isn't merely the sum of the raw scores for a set of HSPs. Likewise, the familiar equation isn't used when calculating the Expect of a sum score. Thus, the rawScoreToBitScore, rawScoreToExpect, and bitScoreToExpect functions must be modified for sum statistics.
7.1.5 Sum Statistics
Now that you've learned how BLAST calculates the Expect of an individual HSP, let's examine how BLAST assigns an Expect to a group of HSPs. Unlike the SmithWaterman algorithm, which finds the single maximum scoring alignment, BLAST finds multiple highscoring pairs. As a result, aligning two sequences often results in multiple HSPs. In some cases, BLAST groups several HSPs in a hit,^{[2]} recalculates, and reports their aggregate statistical significance in place of each HSP's individual Expect. The ubiquitous KarlinAltschul equation isn't used to calculate the aggregate statistical significance of a group of HSPs; instead, a related measure is used that employs a sum score..
^{[2]} Here hit means one or more HSPs. You'll encounter the word "hit" frequently in the BLAST literature and when using BLAST.
Many BLAST users are surprised to learn the BLAST employs not one, but two measures of statistical significance. This misconception is understandable, as little in a BLAST report alerts the casual user to this fact. In the default BLAST format, the only indication that sum statistics were applied to a set of HSPs is the presence of the Expect(n) (in an NCBI BLAST report) and the Sum P(n) (in a WUBLAST report). Figure 72 provides an overview of the procedure used by NCBIBLAST to derive an Expect(n), and the following section discusses each calculation in detail.
Figure 72. The essential calculations involved in deriving the aggregate Expect for a group of HSPs
7.1.6 An Expect(n) Means That Sum Statistics Were Applied
Unless you know what to look for, you probably won't notice that the output in Example 72 contains two HSPs that were grouped for statistical purposes. The reported Expect(2) for these two alignments is the Expect for their combined or sum score rather than their reported bit scores. As such, it doesn't refer to the actual statistical significance of either alignment's reported bit score.
Example 72. Two BLASTX HSPs to which sum statistics were applied
Score = 71.2 bits (173), Expect(2) = 1e15
Identities = 31/59 (52%), Positives = 44/59 (74%)
Frame = 1
Query: 24837 WLDFLYYCSYVKLTITIIKYVPQALMNYRRKSTSGWSIGNILLDFTGGTLSMLQMILNA 24661
WL + + +++ +T +KY+PQA MN+ RKST GWSIGNILLDFTGG + LQM++ +
Sbjct: 148 WLWLISIFNSIQVFMTCVKYIPQAKMNFTRKSTVGWSIGNILLDFTGGLANYLQMVIQS 206
Score = 38.5 bits (88), Expect(2) = 1e15
Identities = 15/34 (44%), Positives = 21/34 (61%)
Frame = 3
Query: 24595 DDWVSIFGDPTKFGLGLFSVLFDVFFMLQHYVFY 24494
+ W + +G+ K L L S+ FD+ FM QHYV Y
Sbjct: 210 NSWKNFYGNMGKTLLSLISIFFDILFMFQHYVLY 243
The reported Expect(2) for both alignments is identical, yet their reported raw scores are 173 and 88, respectively. Obviously, the reported Expect for these two alignments can't have been calculated using these raw scores. When BLAST applies sum statistics to a set of HSPs, it uses their sum score to calculate their combined Expect. Unfortunately, this score isn't available anywhere in the report. (For more information about sum scores, see Chapter 4). The sum score, however, can be calculated with the various parameters listed in Table 72. These values are taken from the same search as the BLASTX hit shown in Example 72. Normally, these values are located in the header and footer of a BLAST report.
Table 72. Parameters and their values required for calculating the aggregate statistical significance of HSPs 

Parameter 
Value 
l 
0.267 nats (gapped) 
k 
0.0410 nats (gapped) 
H 
0.140 nats/aligned residue 
m 
40206 (length of the query sequence) 
n 
270 (length of the subject sequence) 
Gap decay constant 
0.1 
Effective _db_length 
78368169 
Effective HSP length 
144 
7.1.7 Sum Statistics Are PairWise in Their Focus
Pairwise is a term to consider when thinking about sum statistics. Previous discussions of BLAST statistics involved formulations that are most intuitive in the context of database searches; for example, the n in the equation refers to the size of the database. Yes, a database may sometimes consist of a single sequence, but in most cases it won't. The published formulations for sum statistics, on the other hand, are pairwise in their focus; only after all the pairwise scores and significance values are calculated are adjustments made for database size. In the following discussion, for example, n refers to the actual length of the subject sequence of the alignment, no matter how many sequences make up the database.
7.1.8 The Sum Score
While neither WUBLAST nor NCBIBLAST reports the sum score for a group of HSPs anywhere in its output, this invisible number is the basic currency of sum statistics; thus, you should understand how it's calculated. Whether or not sum statistics are applied to a group of HSPs depends on the details of the alignments themselves. If the HSPs are ordered consistently with respect to the subject and query begin and end coordinates, BLAST calculates a sum score. If not, it reports the raw score and individual Expect for each HSP in the BLAST Hit. The following Perl function calculates the (pairwiseordered) sum score for a collection of r HSPs. (For more information about sum scores see Chapter 4.)
sub sumScore {
my $raw_scores = shift; # raw scores are in an array reference
my $k = shift;
my $l = shift;
my $m = shift; # effective length of query sequence
my $n = shift; # effective length of sbjct sequence
my $g = shift; # gap_size;for NCBIBLAST this value is 50
my $r = @{$raw_scores};
die "do not take sum for a single score!\n"
if $r == 1;
my $total_raw_score = 0;
foreach my $individual_raw_score (@{$raw_scores}){
$total_raw_score += $individual_raw_score;
}
my $n_score = $l*$total_raw_score;
return
$n_score  log($k*$m*$n)($r 1)*(log($k)+2*log($g))log(fac($r));
}
7.1.9 Effective Length of a BLASTX Query
To calculate a sum score, you need to calculate the effective length of the query sequence. Recall, however, that for BLASTX, the query is a nucleotide sequence, and yet a partial translation of that sequence is being aligned. What, then, is its effective length of m´, for purposes of sum score calculations? BLASTX considers the effective length of the nucleotide query sequence, m´, to be equal to its translated actual length (m/3) minus the expected HSP length. The following Perl function calculates the effective length of a BLASTX query:
sub effectiveLengthOfBlastxQuery {
my $m = shift; # actual nucleotide length of the query
my $exp = shift; # expected HSP length.
# m' = m/3expected_HSP_length
return $m/3  $exp;
}
Recall that calculating a sum score also requires you to calculate the effective length, n´, of the subject sequence. To do so, use the Perl function, effectiveLengthSeq, given earlier in the chapter, because it also applies to the subject sequence for purposes of calculating a sum score.
7.1.10 Calculating a Sum Score
If you look at Example 72, you'll see that these two BLASTX HSPs comprise an ordered set. In other words, these two alignments suggest that the query sequence contains a minus strand gene, at least two exons of which are homologous to the subject sequence. Because these two alignments comprise a consistently ordered set, you will calculate their pairwise ordered sum score. Using the Perl function sumScore that's in Section 7.1.8, the sum score for these two HSPs is about 53 nats, or 77 bits.
7.1.11 Calculating the PairWise Sum PValue
The sum score (77 bits) for these two HSPs isn't much more than that of the first HSP's individual bit score (71.2). Why, then, is the resulting Expect (2) for these two HSPs so low: Expect (2) = 1e^{15}, when the first HSP with only a slightly lower bit score has a much less significant individual Expect of 3.7e^{10}?
The reason for the discrepancy is that the familiar KarlinAltschul equation isn't used when calculating the Expect of a sum score. Sum statistics uses a very different formula. In fact, an Expect isn't calculated, but rather a pairwise sum Pvalue. A Perl function for calculating this value is shown next:
sub pairwiseSumP {
my $sumS = shift; # the sum score
my $r = shift; # number of HSPs being combined
return (exp(1*$sumS)*$sumS**fac($r1))/(fac($r)*fac($r 1));
}
sub fac {
my $r = shift;
my $fac;
for (my $i = $r; $i > 0; $i){
$fac = defined($fac) ? $fac*$i: $i;
}
$fac = 1 unless defined($fac);
return $fac;
}
Using this function, the pairwise sum P of a sum score of 53 nats is about 2e^{22}. That's a lot less than the reported Expect(2) of 1e^{15}. The discrepancy occurs because 2e^{22} isn't the Expect(2), but the pairwise Pvalue for these two alignments. You must perform two additional calculations using thepairwise Pvalue to derive the Expect(n). First, adjust the pairwise sum Pvalue for additional significance tests performed when identifying combinations of alignments whose sum P is more significant than any one of its member's individual significance. Second, convert the adjusted Pvalue into an Expect(n) by correcting for database size.
7.1.12 Correcting for Multiple Tests
BLAST will group a set of HSPs only if the Expect(n) of the group is less than the Expect of any individual member, and if that group is an optimal partition such that no other grouping might result is a lower Expect(n). Identifying these optimal groups is done internally by BLAST and requires testing many potential groups for statistical significance. You must make a correction for these tests. BLAST uses a testcorrection function that takes the number of HSPs in the group—the "n" of the Expect(n), rather than the actual number of comparisons made. A function for performing this correction on the pairwise sum Pvalue is shown next. In the function, r is simply the number of HSPs being grouped. Because you're dealing with an Expect(2) in Example 72, r = 2 (again, see Chapter 4 for more information).
sub rTestsCorrectedP {
my $r = shift;
my $sum_p = shift;
my $beta = shift; # gap decay constant
# P'_{r} = P_{r}/b^{r1}(1b)
return $sumP/($beta**($r1)*(1$beta));
}
b in the function above is the gap decay constant which, by default, is 0.1 for NCBIBLASTX. Applying this function to the pairwise sum Pvalue gives a testcorrected value of 2.2e^{21}. This value isn't very different from the original, as when r = 2, so is equal to 1/11. Notice, however, that for values of r > 5, this correction becomes much less trivial.
7.1.13 Correcting for Database Size
Converting the (testcorrected) pairwise sum Pvalue to a databasesize corrected Expect is the final step in calculating an Expect(n). How best to do this isn't an axiomatic issue, but a practical one. Chapter 4 discusses some of the issues surrounding database size correction in more detail. NCBIBLASTX applies a size correction that assumes the number of HSPs are proportional to the length of the subject sequence.
sub dbSizeCorrectedExpect {
my $sumP = shift; # test corrected sumP
my $effective_db_length_db = shift;
my $sbjct_seq_length = shift; # actual length
# = (effective_db_length_db/n)P
return ($effective_db_length_db/$sbjct_seq_length)*$sumP;
}
Using this function, the testcorrected, pairwise sum Pvalue of 2.2e^{21} gives an Expect(2) of 7e^{16}, a fairly close match to the reported value (in Example 72) of 1e^{15}. The difference between the two values can be attributed to rounding and floatingpoint error.
7.1.14 Frame and SizeCorrected Expects
BLASTX translates all six frames of the query and uses these translations to search the protein database. At first, you might think that correcting for six frames would entail multiplying the final databasesizecorrected Expect by six, but that isn't the case. Neither version of BLASTX applies such a correction. In fact, WUBLASTX posts a notice in its header stating that it doesn't apply the correction (illustrated in Example 73). Like NCBIBLAST, WUBLAST searches all six frames by default, but assumes that only one frame was searched when calculating Expects. The reason why is that BLAST statistics assume open reading frames don't overlap. Accordingly, BLASTX statistics assume that the query contains only a single ORF whose frame may change from HSP to HSP.
Example 73. The header from a WUBLASTX report
BLASTX 2.0MPWashU [09Nov2000] [linuxi686 19:13:41 11Nov2000]
Copyright (C) 19962000 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.
Reference: Gish, W. (19962000) http://blast.wustl.edu
Gish, Warren and David J. States (1993). Identification of protein coding
regions by database similarity search. Nat. Genet. 3:26672.
Notice: statistical significance is estimated under the assumption that the
equivalent of one entire reading frame in the query sequence codes for protein
and that significant alignments will involve only coding reading frames.
Query= 3R 3R.3 [18846615 18886821] flank:20000 length:40206
(40,206 letters)
Translating both strands of query sequence in all 6 reading frames
Database: rscu.fsa
386,401 sequences; 134,009,913 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done
So far in this chapter, we've just walked through most basic operations of KarlinAltschul statistics to provide you with the knowledge necessary to calculate bit scores, effective lengths, and Expects. We've explained that BLAST uses one statistical measure to calculate the Expect of an HSP and another to calculate the aggregate Expect of a group of HSPs. Hopefully, you've gained a better understanding of how all of these operations of fit into the larger picture of KarlinAltschul statistics.
You have also seen that it's possible to use KarlinAltschul statistics to recover statistical measures that are calculated by BLAST internally, but not included in the report—principally, sum scores and the individual Expect for an HSP for which an Expect(n) has been reported. Learning to calculate these values is the first step toward becoming a power user of BLAST statistics. The remaining sections of this chapter will show you how to use what you've learned to deal with critical questions about BLAST results.
7.2 Using Statistics to Understand BLAST Results
KarlinAltschul statistics is much more than a way to determine the statistical significance of a sequence alignment in the context of a database search. It also provides a framework with which to probe the complex relationships that exist between BLAST parameters and results. Using KarlinAltschul statistics to ask and answer questions about a BLAST search is much like using stoichiometry at the lab bench; it doesn't require theoretical savvy, just a little algebra. It's also useful; you no longer need to be frustrated when confronted with an inexplicable BLAST result.
Now let's look at a practical application of KarlinAltschul statistics: using BLASTN to map a PCR primer to a genome. The application is a simple but striking example of how to use KarlinAltschul statistics to understand the way parameter choice determines BLAST results. Finally, KarlinAltschul statistics reveal much about BLASTN's strengths and weaknesses and its potential as a tool to detect the conserved, cisregulatory regions of genes.
7.3 Where Did My Oligo Go?
First, try to identify the position of the following oligonucleotide in the Drosophila melanogaster genome using WUBLASTN with its default parameters:
TACATCCGGCACTTAGCCGGGCTCG
Example 74 shows that the oligo isn't found in the Drosophila melanogaster genome that uses WUBLASTN with default parameters.
Example 74. The oligo isn't found
Reference: Gish, W. (19962000) http://blast.wustl.edu
Notice: this program and its default parameter settings are optimized to find
nearly identical sequences rapidly. To identify weak similarities encoded in
nucleic acid, use BLASTX, TBLASTN or TBLASTX.
Query= oligo
(25 letters)
Database: na_wholegenome_genomic_dmel_RELEASE3.FASTA
7 sequences; 124,181,667 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done
Smallest
Sum
High Probability
Sequences producing Highscoring Segment Pairs: Score P(N) N
*** NONE ***
There are, of course, many reasons why you might not be able to identify an oligo in the Drosophila melanogaster genome. First, the oligo might contain repetitive sequence and thus be masked out. However, because WUBLAST doesn't mask by default, that can't be the reason. Second, the assembled genome may be incomplete. Every sequenced genome to date is incomplete to some degree. In fact, a 99 percent complete 124mb genome is still missing 1.24 megabases of a euchromatic (nonrepetitive DNA) sequence, leaving plenty of space for an oligo to go missing in. The incompleteness of the genome is a possible explanation for our WUBLAST result, but is it the correct one? Before concluding that the oligo falls into a sequencing gap, let's try to run NCBIBLASTN with its default parameters. Aha! The NCBIBLASTN results in Example 75 show that the oligo is present in the Drosophila melanogastergenome and the HSP is assigned a significant Expect.
Example 75. Using NCBIBLASTN to find the oligo
Sequences producing significant alignments: (bits) Value
2R 2R.3 assembled 23112001 50 1e06
X X release:2 length:21666217bp Assembled X chromosome arm seque... 32 0.25
3R 3R.3 32 0.25
U GenomicInterval:U 30 0.99
3L 3L.3 v.3e 23351213bp BCM HGSC guide:3lmtpeval.08apr02 28 3.9
2L 2L release:3 length:22217931bp Assembled 2L chromosome arm se... 28 3.9
>2R 2R.3 assembled 23112001
Length = 20302755
Score = 50.1 bits (25), Expect = 1e06
Identities = 25/25 (100%)
Strand = Plus / Plus
Query: 1 tacatccggcacttagccgggctcg 25

Sbjct: 16190927 tacatccggcacttagccgggctcg 16190951
Results like these frustrate a lot of BLAST users. Why does NCBIBLAST find the oligo when WUBLAST doesn't? The results may seem contradictory, but they make perfect sense, and understanding why this is so will help you use KarlinAltschul statistics to ask questions about your own BLAST results.
7.3.1 KarlinAltschul Statistics as a Tool for Further Investigation
Parameter choice seems a likely explanation for the results shown in Examples Example 74 and Example 75. If you assume the failure of WUBLASTN to report the alignment isn't due to a bug, maybe the hit wasn't significant in the context of the current search. By now you've been exposed to enough KarlinAltschul statistics to know that BLAST parameters determine the significance of an alignment. The scoring scheme used in a search is the fundamental BLAST parameter, so you should begin your investigation there.
WUBLASTN and NCBIBLASTN have very different default scoring schemes. WUBLASTN uses a +5/4 scheme to score alignments by default, whereas NCBIBLASTN uses a +1/3 scheme. One central theorem of KarlinAltschul statistics is that every scoring scheme is implicitly a logodds scoring scheme, and every logodds scoring scheme implies a target frequency. This insight is significant because it means that a scoring scheme always hunts for alignments having a particular percent identity. One great thing about KarlinAltschul statistics is they enable you to calculate that percent identity.
Chapter 4 provides a Perl script called Qcalc for calculating various nucleotide scoring scheme target frequencies; Table 41 summarizes the l and percent identity implied by some common scoring schemes. The default +5/4 scoring scheme used by WUBLAST implies an ungapped target frequency of 65 percent identity, whereas the +1/3 default scheme used by NCBIBLASTN looks for alignments with 99 percent identity. But why would a +5/4 scoring scheme miss a real alignment with 100 percent identity? The answer to this question lies in the value of l and the role played by gap penalties.
Table 73 gives 0.104 nats for the value of gapped l associated with the WUBLASTN default scoring scheme and gap penalties. On the other hand, the l associated with the NCBIBLASTN default scoring scheme and gap penalties is 1.37 nats (Table 74). The difference between the two values is tenfold. To see how the value of l impacts your search results, use KarlinAltschul statistics to delve more deeply into the relationship between raw scores, normalized (bit and nat) scores, and l.
Table 73. Selected WUBLASTN parameters and values from the search shown in Example 75 

Parameter 
Value 
l 
0.104 nats (gapped) 
k 
0.0151 nats (gapped) 
H 
0.0600 nats/aligned residue 
m 
25 (length of the query sequence) 
n 
124,181,667(number of letters in the database) 
Number of sequences in database 
7 
Table 74. Selected NCBIBLASTN parameters and values from the search shown in Example 75 

Parameter 
Value 
l 
1.37 nats (gapped) 
k 
0.711 nats (gapped) 
H 
1.31 nats/aligned residue 
m 
25 (length of the query sequence) 
n 
124,181,667 (number of letters in the database searched) 
Number of sequences in database 
7 
First, see how the value of l impacts the expected HSP length. Recall that this value is the length of an alignment that has an Expect=1; more precisely, it's the expected HSP length for E equal to 1 associated with a scoring matrix for a search space of size m´*n´. This value is reported in the footer of NCBIBLAST reports, where it's called "Effective HSP length." However, it's absent from the WUBLAST footer, so you should calculate it with the Perl function expectedHSPlength and the information from Tables Table 73 and Table 74.
Please note that the following function shows the traditional way to calculate the expected HSP length. Recent work by Altschul and colleagues suggests that this function overestimates the effective HSP length for short sequences, and the manner in which it's calculated may change in the future. Therefore, use the value reported in the BLAST report footer, if it's available. For the purposes here, though, this function is fine.
sub expectedHSPlength{
my $k = shift;
my $m = shift; # actual length of query
my $n = shift; # actual length of query
my $h = shift; # average nats/aligned pair
# l = ln(kmn)/H
return log($k*$m*$n)/$h;
}
expectedHSPlength returns an expected HSPlength of about 16 nucleotides for the NCBI defaults. The expected length of the WUBLASTN HSP with Expect = 1, however, is higher—about 294 nucleotides. That's a big difference. Once again, the reason for the difference lies with the scoring matrix. Recall that the implied target frequency for the NCBIdefault +1/3 scoring scheme was 99 percent, but it was 65% for the WUBLAST defaults (see Table 41). This is why the effective HSP length for the WUBLAST search is so much longer. The hypothetical 294nucleotide alignment is expected to have a percent identity of less than 65 percent. In other words, taking into account mismatches and gaps, it needs to be 294 bases long to attain a raw score sufficient to generate an Expect of 1. Thus, the WUBLAST defaults implicitly assume that nucleotide homologies will have low identity (<65%), but be long—294 nucleotides in the context of the Drosophila melanogaster genome. Is this biological assumption valid? Yes and no.
The WUBLASTN defaults are well suited for detecting long regions of low identity such as poorly conserved exons. On the other hand, the NCBIBLAST parameters are suitable for finding shorter but nearly identical sequences. Both sets of default parameters will likely fail to detect other kinds of homology, especially short, conserved sequences such as cisregulatory elements, which tend to be highly conserved and are often less than 10 nucleotides long.
Just how short can an HSP be and still generate a significant hit using WUBLASTN defaults? Again, KarlinAltschul statistics provide a basis for answering this question. First you need to know what raw score corresponds to an Expect of 1. The following Perl function calculates this value:
sub rawScoreOfExpectOne {
my $k = shift;
my $m = shift; # actual length of query
my $n = shift; # actual length of database
my $l = shift; # gapped lambda in nats
# S_{E=1} = ln(kmn)/l
return log($k*$m*$n)/$l;
}
For the ~124 megabase Drosophila melanogaster genome, that raw score is about 15 with the NCBIBLASTN defaults and about 170 for WUBLASTN. Recall that the maximum score for aligning an oligonucleotide 25 bases long under the +5/4 scheme is 125 (25*5). Even an alignment with an Expect of 1 has a raw score (170) greater than the maximal attainable score for a 25mer under the WUBLAST defaults! This is another reason why WUBLASTN didn't report a hit.
To determine the length of an ungapped alignment that has 100 percent identity for a given raw score, divide the raw score by the match score. By this calculation, any oligo shorter than about 15 nucleotides (15/1) for NCBIBLASTN and 34 nucleotides (170/5) for WUBLASTN will have an Expect > 1. This means that the NCBIBLASTN defaults are fine for mapping oligonucleotides to the Drosophila melanogaster genome. On the other hand, it appears that looking for short—less than 15 basepair—cisregulatory elements using either version of BLASTN with the default parameters is unlikely to be successful.
So what was the unreported WUBLASTN Expect? Let's calculate it. With the data in Table 73 and the previously calculated effective HSP length of 294, first calculate m´ and n´ using the Perl functions effectiveLengthSeq and effectiveLengthDB. Plugging m´ and n´ together with the WUBLASTN l and k and a raw score of 125 into the rawScoreToExpectfunction gives an Expect of 281. Recall that the NCBIBLASTN Expect was 1e^{6}. That's a 281millionfold difference. BLAST is clearly parametersensitive! Using the default parameters, you instructed NCBIBLASTN to search for short highly conserved regions, and it found one. WUBLASTN, on the other hand, is parameterized to look for large regions of relatively low percent identity. This would be fine for crossspecies searches of poorly conserved exons but is inappropriate for finding oligos.
Using BLAST intelligently requires using the correct parameters for the task at hand and not placing too much faith in the reported Expect. See the section on BLAST protocols in Chapter 9 for practical suggestions on BLAST parameter choice. Remember, you get what you look for.
7.3.2 What It All Means
You now know how bit scores, sum scores, Expects, and Pvalues are calculated. You've also seen firsthand that scoring matrices and target frequencies aren't merely theoretical abstractions but realities that determine the outcome of a BLAST search. In some ways, choosing the right scoring scheme for a BLAST search is like choosing the right pair of eyeglasses. If your scoring scheme is too stringent, BLAST becomes nearsighted and will miss distant homologies. If your scheme is too lenient, BLAST becomes farsighted and fails to detect the obvious. Unfortunately, there's no optimal scoring scheme. As in real life, sometimes the best you can do is put on bifocals.
You've also seen that searching the same sequence and database with varied parameters can result in different alignments having very different Expects. Scores and Evalues aren't implicit in a sequence or an alignment; they are solely contingent upon parameter values and the methods used to assess significance. There is nothing absolute about a BLAST significance value; it merely denotes the significance of an alignment in the context of a given search. Like everything else in bioinformatics, the biological implications of a (significant) alignment are inferred by the user and should be tested experimentally, if possible.
Hopefully, you've also learned that there is more to KarlinAltschul statistics than simply calculating an Expect for an alignment. KarlinAltschul statistics provide a theoretical framework from which to interpret alignment scores in the context of parameter choice. They also give you the means to tune BLAST for specific purposes. Without them, you'd have no way of knowing what a given scoring scheme was looking for, and you'd cast around in the dark for the right set of parameters. KarlinAltschul statistics remove the mystery from parameter choice. BLAST certainly has its limitations, but thanks to its statistical foundation, at least you know what you're looking for.