﻿ A BLAST Statistics Tutorial - BLAST (Basic Local Alignment Search Tool)

## BLAST (Basic Local Alignment Search Tool)

### Chapter 7. A BLAST Statistics Tutorial

BLAST statistics are everywhere in biology today. In fact, it's hard to find a molecular-biology paper, grant proposal, patent application, or biotech business plan that doesn't refer to the Expect or P-value of a BLAST result. The BLAST Expect has so permeated biological thinking in recent years that for many scientists it has become synonymous with biological truth. Tell a colleague that you've just cloned a gene that's homologous to something trendy, and odds are that he will ask what the Expect of its alignment was in a BLAST search.

Of course, what some see as a sweeping change, others label a fad. While some researchers consider BLAST statistics a welcome injection of mathematical rigor into the biological world, others lament the abandonment of biological insight for faith in a number. No matter where you stand on this issue, there is no avoiding the reality of BLAST statistics in today's bioinformatics workplace. Understanding what the numbers in a BLAST report mean and how they are derived isn't just for mathematicians; it's a real-world survival skill for biologists and bioinformatics professionals in academia and industry alike.

The material covered in this chapter is practical rather than theoretical in nature. Chapter 4 summarized some of the theory behind local alignment statistics. Read that chapter to learn more about the basic parameters of BLAST: l, k, and H. This chapter shows how to calculate the numbers in a BLAST report and use this knowledge to better understand the results of a BLAST search. 7.1 Basic BLAST Statistics

Primary BLAST literature doesn't focus on the arithmetic involved in calculating an Expect. Understandably, the papers discuss the theoretical underpinnings of Karlin-Altschul statistics, leaving BLAST users without the kinds of examples found in a text book. This chapter provides this missing information.

Figure 7-1 summarizes the basic operations involved in converting the raw score of a high-scoring pair into an Expect, or P-value. The following discussion explains how to perform each calculation and any accessory operations required to derive an Expect.

Figure 7-1. The essential calculations involved in converting an HSP's raw score into a probability Calculating Karlin-Altschul statistics isn't easily done with a pencil, paper, and calculator. Applying Karlin-Altschul statistics to an HSP and avoiding a migraine requires code. We use simple Perl subroutines to take the grind out of the calculations and help clarify the equations. Hopefully these subroutines will also provide a basis for further investigations of your own BLAST results. If you'd like to add these subroutines to your codebase, you can download them from this book's web site; the module name is BlastStats.pm.

Example 7-1 shows a sample HSP from a BLASTP search. Use this alignment as a reference to check your calculations. For our present purposes, the details of the search aren't important, but the values given in the header and footer of the report are. The header and footer are omitted from the search that generated this alignment.

 If you plan to perform the calculations described in this chapter and have searched the NCBI web site (http://www.ncbi.nlm.nih.gov/BLAST), make sure you don't select the composition-based statistics in the Advanced Options checkbox because the values of the composition-based lambdas aren't given in the report. Currently, it's selected by default for BLASTP, and you unselect it by clicking on it.

Example 7-1. A sample HSP from a BLASTP search

>CT12121 translation from_gene[CG3603 CG3603 FBgn0029648]

seq_release:2  mol_weight=18568

cds_boundaries:(X:2,699,992..2,700,667[+]) aa_length:176

transcript_info:[CT12121  seq_release:2] gene_info:[gene

symbol:CG3603 FBgn0029648

gene_boundaries:(X:2,699,935..2,700,670[+]) (GO:0016614

"oxidoreductase, acting on CH-OH group of donors")]

Length = 176

Score = 70.9 bits (172), Expect = 8e-13

Identities = 49/170 (28%), Positives = 85/170 (49%), Gaps = 6/170 (3%)

Query: 50  IAGEVAVVTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDIYKVRAKAYKA 109

+AG+VA+VTGAG G+GRA    LA+ G  +  VD N+  A++TV   Q++   R+ A +

Sbjct: 6   LAGKVALVTGAGSGIGRATCRLLARDGAKVIAVDRNLKAAQETV---QELGSERSAALEV 62

+V++   +    ++ +++     T++VN+AG+     +      D   +  VNL   F

Sbjct: 63  DVSSAQSVQFSVAEALKKFQQAPTIVVNSAGITRDGYLLKMPERDYDDVYGVNLKGTFLV 122

Query: 169 KLVFLPKM--KELRKGFIVTISSLAGVFPLPYSATYTTTKSGALAHMRTL 216

+   M  ++L  G IV +SS+   A Y  TK+G ++    L

Sbjct: 123 TQAYAKAMIEQKLENGTIVNLSSIVAKMNNVGQANYAATKAGVISFTERL 172

 Table 7-1. The parameters and their values required for Karlin-Altschul statistical calculations Parameter Value l 0.267 nats (gapped) k 0.0410 nats (gapped) H 0.140 nats/aligned residue m 321 (length of the query sequence) n 9418064 (number of letters in the database searched) Effective HSP length 99 Number of sequences in database 17878

7.1.1 Actual Versus Effective Lengths

The Karlin-Altschul equation is probably the most recognized equation in bioinformatics. It states that the number of alignments expected by chance (E) during a sequence database search is a function of the size of the search space (m*n), the normalized score (lS), and a minor constant (k). (For more information about the Karlin-Altschul equation and its associated parameters, see Chapter 4). The first step toward understanding how to use the Karlin-Altschul equation is to understand what values for m and n BLAST uses when calculating an Expect. Many users assume that the m and n in the equation refer to the length of the query and subject sequences; this isn't the case. In practice, you use the effective lengths rather than actual lengths of the query and database when calculating the Expect of an HSP.

The terms used in the BLAST literature to denote the actualand effective lengths of the query, subject, and database can be confusing. Sometimes m and n denote actual lengths, and sometimes they denote the effective lengths. To avoid confusion, this chapter uses m and n to denote actual lengths and  and  to denote effective lengths.

The effective length of the query, , is its actual length minus the expected HSP length. The effective length of the database, , is the sum of the effective length of every sequence within it. In other words, if the sequence database used in a BLAST search contains only a single sequence, the actual length of that database, n, is the length of that sequence; the effective length, , of that database is the effective length of the single sequence it contains. If the database contains two sequences, its actual length is the sum of the lengths of those two sequences; the effective length of that database is the sum of those two sequences' effective lengths.

Calculating  and  requires knowing the expected HSP length, which is the length of an HSP that has an Expect of 1. You can find this value in the NCBI-BLAST report footer (where it's called, confusingly, effective HSP length). You can also calculate it, as you'll learn later in this chapter. For now, though, use the value from Table 7-1. Here are two Perl functions used to calculate  and :

sub effectiveLengthSeq {

my \$m      = shift; # actual length sequence

my \$expected_HSP_length = shift;

my \$k      = shift; # gapped k

my \$m_prime = \$m - \$expected_HSP_length;

if (\$m_prime < 1/\$k){

return 1/\$k;

}

else {

return \$m_prime;

}

}

sub effectiveLengthDb {

my \$n       = shift; # actual length

my \$expected_HSP_length  = shift;

my \$num_seqs_in_db       = shift;

my \$k       = shift; # gapped k

my \$n_prime = \$n - (\$num_seqs_in_db*\$expected_HSP_length);

if (\$n_prime < 1/\$k){

return 1/\$k;

}

else {

return \$n_prime;

}

}

Notice that no effective length of the query or the database can ever be less than 1/k. Setting an effective length to 1/k basically amounts to ignoring a short sequence for statistical purposes; in cases when both m and n are less than 1/k, BLAST searches are ill-advised. For more information about effective lengths and short sequences, see Chapter 4.

Using these functions with the information provided in Table 7-1 gives a value of 222 and 7648142 for  and , respectively.

7.1.2 The Raw Score and Bit Score

BLAST reports two scores: the raw score and a normalized score called abit score. The raw scoreS, is the sum of the alignment's pair-wise scores using a specific scoring matrix (see Chapter 4 for more information). When deriving the bit score for an HSP, first convert its raw score to a nat score using the following equation: Because BLAST reports l in nats (or more precisely, nats per unit raw score) rather than bits, you must divide the nat score, by ln(2) to convert it to a bit score. As you become a more critical user and interpreter of BLAST reports, you may find yourself converting nats to bits and raw scores to bit scores regularly. The following two Perl subroutines convert bits to nats, and vice versa:

sub nToB {

my \$n = shift;

return \$n/log(2); # converts nats to bits

}

sub bToN {

my \$n = shift;

return \$n*log(2); # converts bits to nats

}

Try using l and k tocalculate the bit score for the alignment shown in Figure 7-1. If you use the following Perl function and the values of kand l given in Table 7-1, you get a bit score of 70.9 for a raw score of 172.

sub rawScoreToBitScore {

my \$raw_score = shift;

my \$k   = shift;

my \$l   = shift; #lambda in nats

# S'bits = (lnatsS - ln (k))/ln(2)

return nToB(\$l*\$raw_score - log(\$k));

}

7.1.3 The Expect of an HSP

Now calculate the Expect for the HSP shown in Figure 7-1, recalling that . Again, a simple Perl function is useful:

sub rawScoreToExpect {

my \$raw_score = shift;

my \$k   = shift;

my \$l   = shift; # lambda in nats

my \$m   = shift; # effective length of query

my \$n   = shift; # effective length of database

# E = km'n'e-lS

return \$k*\$m*\$n*exp(-1*\$l*\$raw_score);

}

Using this function, the values of kand l, given in Table 7-1, combined with the values  (222) and  (7648142) that you calculated in your discussion of effective lengths, gives an Expect of 8e-13 for the alignment shown in Figure 7-1.

You can also calculate the Expect of an alignment with a normalized score,  (Figure 7-1). The Karlin-Altschul equation is formulated for the raw score, S, not the normalized score . To calculate an Expect using a normalized score  whose units are nats, use the equation E = m'n'e-S'. Note that k doesn't appear in this equation; it has already been accounted for when deriving the normalized (nat) score (e.g., ).

To calculate the Expect of an HSP from its bit score (Figure 7-1) use the Perl function shown next. The formula is similar to that used to calculate an Expect from a nat score. However, the base of the exponent is 2 rather than e because you're using bits rather than nats.

sub bitScoreToExpect {

my \$bit_score = shift;

my \$m   = shift; # effective length query

my \$n   = shift; # effective length of database

# reformulated for bits

# E = m'n'2-bit_score

return \$m*\$n*2**(-1*\$bit_score);

}

7.1.4 The WU-BLAST P-Value

Another important difference between WU-BLAST and NCBI-BLAST is that WU-BLAST reports a P-value as well as an Expect for an alignment. The two functions shown below will convert between these two related measures of statistical significance. Because P is equal to 1-e-EP E if either value is less than 0.001.

sub EtoP {

my \$e = shift;

# P = 1 - e-E

return 1-exp(-1*\$e);

}

sub PtoE {

my \$p = shift;

# E = -ln(1 - P)

return -1*log(1-\$p);

}

Some of the calculations discussed in this chapter don't apply to sum statistics. The sum score for a set of alignments isn't merely the sum of the raw scores for a set of HSPs. Likewise, the familiar equation isn't used when calculating the Expect of a sum score. Thus, the rawScoreToBitScore, rawScoreToExpect, and bitScoreToExpect functions must be modified for sum statistics.

7.1.5 Sum Statistics

Now that you've learned how BLAST calculates the Expect of an individual HSP, let's examine how BLAST assigns an Expect to a group of HSPs. Unlike the Smith-Waterman algorithm, which finds the single maximum scoring alignment, BLAST finds multiple high-scoring pairs. As a result, aligning two sequences often results in multiple HSPs. In some cases, BLAST groups several HSPs in a hit, recalculates, and reports their aggregate statistical significance in place of each HSP's individual Expect. The ubiquitous Karlin-Altschul equation isn't used to calculate the aggregate statistical significance of a group of HSPs; instead, a related measure is used that employs a sum score..

 Here hit means one or more HSPs. You'll encounter the word "hit" frequently in the BLAST literature and when using BLAST.

Many BLAST users are surprised to learn the BLAST employs not one, but two measures of statistical significance. This misconception is understandable, as little in a BLAST report alerts the casual user to this fact. In the default BLAST format, the only indication that sum statistics were applied to a set of HSPs is the presence of the Expect(n) (in an NCBI BLAST report) and the Sum P(n) (in a WU-BLAST report). Figure 7-2 provides an overview of the procedure used by NCBI-BLAST to derive an Expect(n), and the following section discusses each calculation in detail.

Figure 7-2. The essential calculations involved in deriving the aggregate Expect for a group of HSPs 7.1.6 An Expect(n) Means That Sum Statistics Were Applied

Unless you know what to look for, you probably won't notice that the output in Example 7-2 contains two HSPs that were grouped for statistical purposes. The reported Expect(2) for these two alignments is the Expect for their combined or sum score rather than their reported bit scores. As such, it doesn't refer to the actual statistical significance of either alignment's reported bit score.

Example 7-2. Two BLASTX HSPs to which sum statistics were applied

Score = 71.2 bits (173), Expect(2) = 1e-15

Identities = 31/59 (52%), Positives = 44/59 (74%)

Frame = -1

Query: 24837 WLDFLYYCSYVKLTITIIKYVPQALMNYRRKSTSGWSIGNILLDFTGGTLSMLQMILNA 24661

WL  +   + +++ +T +KY+PQA MN+ RKST GWSIGNILLDFTGG  + LQM++ +

Sbjct: 148   WLWLISIFNSIQVFMTCVKYIPQAKMNFTRKSTVGWSIGNILLDFTGGLANYLQMVIQS 206

Score = 38.5 bits (88), Expect(2) = 1e-15

Identities = 15/34 (44%), Positives = 21/34 (61%)

Frame = -3

Query: 24595 DDWVSIFGDPTKFGLGLFSVLFDVFFMLQHYVFY 24494

+ W + +G+  K  L L S+ FD+ FM QHYV Y

Sbjct: 210   NSWKNFYGNMGKTLLSLISIFFDILFMFQHYVLY 243

The reported Expect(2) for both alignments is identical, yet their reported raw scores are 173 and 88, respectively. Obviously, the reported Expect for these two alignments can't have been calculated using these raw scores. When BLAST applies sum statistics to a set of HSPs, it uses their sum score to calculate their combined Expect. Unfortunately, this score isn't available anywhere in the report. (For more information about sum scores, see Chapter 4). The sum score, however, can be calculated with the various parameters listed in Table 7-2. These values are taken from the same search as the BLASTX hit shown in Example 7-2. Normally, these values are located in the header and footer of a BLAST report.

 Table 7-2. Parameters and their values required for calculating the aggregate statistical significance of HSPs Parameter Value l 0.267 nats (gapped) k 0.0410 nats (gapped) H 0.140 nats/aligned residue m 40206 (length of the query sequence) n 270 (length of the subject sequence) Gap decay constant 0.1 Effective _db_length 78368169 Effective HSP length 144

7.1.7 Sum Statistics Are Pair-Wise in Their Focus

Pair-wise is a term to consider when thinking about sum statistics. Previous discussions of BLAST statistics involved formulations that are most intuitive in the context of database searches; for example, the n in the equation refers to the size of the database. Yes, a database may sometimes consist of a single sequence, but in most cases it won't. The published formulations for sum statistics, on the other hand, are pair-wise in their focus; only after all the pair-wise scores and significance values are calculated are adjustments made for database size. In the following discussion, for example, n refers to the actual length of the subject sequence of the alignment, no matter how many sequences make up the database.

7.1.8 The Sum Score

While neither WU-BLAST nor NCBI-BLAST reports the sum score for a group of HSPs anywhere in its output, this invisible number is the basic currency of sum statistics; thus, you should understand how it's calculated. Whether or not sum statistics are applied to a group of HSPs depends on the details of the alignments themselves. If the HSPs are ordered consistently with respect to the subject and query begin and end coordinates, BLAST calculates a sum score. If not, it reports the raw score and individual Expect for each HSP in the BLAST Hit. The following Perl function calculates the (pair-wise-ordered) sum score for a collection of r HSPs. (For more information about sum scores see Chapter 4.)

sub sumScore {

my \$raw_scores = shift; # raw scores are in an array reference

my \$k    = shift;

my \$l    = shift;

my \$m    = shift; # effective length of query sequence

my \$n    = shift; # effective length of sbjct sequence

my \$g    = shift; # gap_size;for NCBI-BLAST this value is 50

my \$r = @{\$raw_scores};

die "do not take sum for a single score!\n"

if \$r == 1;

my \$total_raw_score = 0;

foreach my \$individual_raw_score (@{\$raw_scores}){

\$total_raw_score += \$individual_raw_score;

} my \$n_score = \$l*\$total_raw_score;

return

\$n_score - log(\$k*\$m*\$n)-(\$r -1)*(log(\$k)+2*log(\$g))-log(fac(\$r));

}

7.1.9 Effective Length of a BLASTX Query

To calculate a sum score, you need to calculate the effective length of the query sequence. Recall, however, that for BLASTX, the query is a nucleotide sequence, and yet a partial translation of that sequence is being aligned. What, then, is its effective length of , for purposes of sum score calculations? BLASTX considers the effective length of the nucleotide query sequence, , to be equal to its translated actual length (m/3) minus the expected HSP length. The following Perl function calculates the effective length of a BLASTX query:

sub effectiveLengthOfBlastxQuery {

my \$m     = shift;  # actual nucleotide length of the query

my \$exp   = shift;  # expected HSP length.

# m' = m/3-expected_HSP_length

return \$m/3 - \$exp;

}

Recall that calculating a sum score also requires you to calculate the effective length, , of the subject sequence. To do so, use the Perl function, effectiveLengthSeq, given earlier in the chapter, because it also applies to the subject sequence for purposes of calculating a sum score.

7.1.10 Calculating a Sum Score

If you look at Example 7-2, you'll see that these two BLASTX HSPs comprise an ordered set. In other words, these two alignments suggest that the query sequence contains a minus strand gene, at least two exons of which are homologous to the subject sequence. Because these two alignments comprise a consistently ordered set, you will calculate their pair-wise ordered sum score. Using the Perl function sumScore that's in Section 7.1.8, the sum score for these two HSPs is about 53 nats, or 77 bits.

7.1.11 Calculating the Pair-Wise Sum P-Value

The sum score (77 bits) for these two HSPs isn't much more than that of the first HSP's individual bit score (71.2). Why, then, is the resulting Expect (2) for these two HSPs so low: Expect (2) = 1e-15, when the first HSP with only a slightly lower bit score has a much less significant individual Expect of 3.7e-10?

The reason for the discrepancy is that the familiar Karlin-Altschul equation isn't used when calculating the Expect of a sum score. Sum statistics uses a very different formula. In fact, an Expect isn't calculated, but rather a pair-wise sum P-value. A Perl function for calculating this value is shown next:

sub pairwiseSumP {

my \$sumS = shift; # the sum score

my \$r    = shift; # number of HSPs being combined return (exp(-1*\$sumS)*\$sumS**fac(\$r-1))/(fac(\$r)*fac(\$r -1));

}

sub fac {

my \$r = shift;

my \$fac;

for (my \$i = \$r; \$i > 0; \$i--){

\$fac = defined(\$fac) ? \$fac*\$i: \$i;

}

\$fac = 1 unless defined(\$fac);

return \$fac;

}

Using this function, the pair-wise sum P of a sum score of 53 nats is about 2e-22. That's a lot less than the reported Expect(2) of 1e-15. The discrepancy occurs because 2e-22 isn't the Expect(2), but the pair-wise P-value for these two alignments. You must perform two additional calculations using thepair-wise P-value to derive the Expect(n). First, adjust the pair-wise sum P-value for additional significance tests performed when identifying combinations of alignments whose sum P is more significant than any one of its member's individual significance. Second, convert the adjusted P-value into an Expect(n) by correcting for database size.

7.1.12 Correcting for Multiple Tests

BLAST will group a set of HSPs only if the Expect(n) of the group is less than the Expect of any individual member, and if that group is an optimal partition such that no other grouping might result is a lower Expect(n). Identifying these optimal groups is done internally by BLAST and requires testing many potential groups for statistical significance. You must make a correction for these tests. BLAST uses a test-correction function that takes the number of HSPs in the group—the "n" of the Expect(n), rather than the actual number of comparisons made. A function for performing this correction on the pair-wise sum P-value is shown next. In the function, r is simply the number of HSPs being grouped. Because you're dealing with an Expect(2) in Example 7-2r = 2 (again, see Chapter 4 for more information).

sub rTestsCorrectedP {

my \$r     = shift;

my \$sum_p = shift;

my \$beta  = shift; # gap decay constant

# P'r = Pr/br-1(1-b)

return \$sumP/(\$beta**(\$r-1)*(1-\$beta));

}

b in the function above is the gap decay constant which, by default, is 0.1 for NCBI-BLASTX. Applying this function to the pair-wise sum P-value gives a test-corrected value of 2.2e-21. This value isn't very different from the original, as when r = 2, so is equal to 1/11. Notice, however, that for values of r > 5, this correction becomes much less trivial.

7.1.13 Correcting for Database Size

Converting the (test-corrected) pair-wise sum P-value to a database-size corrected Expect is the final step in calculating an Expect(n). How best to do this isn't an axiomatic issue, but a practical one. Chapter 4 discusses some of the issues surrounding database size correction in more detail. NCBI-BLASTX applies a size correction that assumes the number of HSPs are proportional to the length of the subject sequence.

sub dbSizeCorrectedExpect {

my \$sumP   = shift; # test corrected sumP

my \$effective_db_length_db = shift;

my \$sbjct_seq_length    = shift; # actual length

# = (effective_db_length_db/n)P

return (\$effective_db_length_db/\$sbjct_seq_length)*\$sumP;

}

Using this function, the test-corrected, pair-wise sum P-value of 2.2e-21 gives an Expect(2) of 7e-16, a fairly close match to the reported value (in Example 7-2) of 1e-15. The difference between the two values can be attributed to rounding and floating-point error.

7.1.14 Frame- and Size-Corrected Expects

BLASTX translates all six frames of the query and uses these translations to search the protein database. At first, you might think that correcting for six frames would entail multiplying the final database-size-corrected Expect by six, but that isn't the case. Neither version of BLASTX applies such a correction. In fact, WU-BLASTX posts a notice in its header stating that it doesn't apply the correction (illustrated in Example 7-3). Like NCBI-BLAST, WU-BLAST searches all six frames by default, but assumes that only one frame was searched when calculating Expects. The reason why is that BLAST statistics assume open reading frames don't overlap. Accordingly, BLASTX statistics assume that the query contains only a single ORF whose frame may change from HSP to HSP.

Example 7-3. The header from a WU-BLASTX report

BLASTX 2.0MP-WashU [09-Nov-2000] [linux-i686 19:13:41 11-Nov-2000]

Copyright (C) 1996-2000 Washington University, Saint Louis, Missouri USA.

Reference:  Gish, W. (1996-2000) http://blast.wustl.edu

Gish, Warren and David J. States (1993).  Identification of protein coding

regions by database similarity search.  Nat. Genet. 3:266-72.

Notice:  statistical significance is estimated under the assumption that the

equivalent of one entire reading frame in the query sequence codes for protein

and that significant alignments will involve only coding reading frames.

Query=  3R 3R.3 [18846615 18886821] flank:20000 length:40206

(40,206 letters)

Translating both strands of query sequence in all 6 reading frames

Database:  rscu.fsa

386,401 sequences; 134,009,913 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

So far in this chapter, we've just walked through most basic operations of Karlin-Altschul statistics to provide you with the knowledge necessary to calculate bit scores, effective lengths, and Expects. We've explained that BLAST uses one statistical measure to calculate the Expect of an HSP and another to calculate the aggregate Expect of a group of HSPs. Hopefully, you've gained a better understanding of how all of these operations of fit into the larger picture of Karlin-Altschul statistics.

You have also seen that it's possible to use Karlin-Altschul statistics to recover statistical measures that are calculated by BLAST internally, but not included in the report—principally, sum scores and the individual Expect for an HSP for which an Expect(n) has been reported. Learning to calculate these values is the first step toward becoming a power user of BLAST statistics. The remaining sections of this chapter will show you how to use what you've learned to deal with critical questions about BLAST results.

7.2 Using Statistics to Understand BLAST Results

Karlin-Altschul statistics is much more than a way to determine the statistical significance of a sequence alignment in the context of a database search. It also provides a framework with which to probe the complex relationships that exist between BLAST parameters and results. Using Karlin-Altschul statistics to ask and answer questions about a BLAST search is much like using stoichiometry at the lab bench; it doesn't require theoretical savvy, just a little algebra. It's also useful; you no longer need to be frustrated when confronted with an inexplicable BLAST result.

Now let's look at a practical application of Karlin-Altschul statistics: using BLASTN to map a PCR primer to a genome. The application is a simple but striking example of how to use Karlin-Altschul statistics to understand the way parameter choice determines BLAST results. Finally, Karlin-Altschul statistics reveal much about BLASTN's strengths and weaknesses and its potential as a tool to detect the conserved, cis-regulatory regions of genes. 7.3 Where Did My Oligo Go?

First, try to identify the position of the following oligo-nucleotide in the Drosophila melanogaster genome using WU-BLASTN with its default parameters:

TACATCCGGCACTTAGCCGGGCTCG

Example 7-4 shows that the oligo isn't found in the Drosophila melanogaster genome that uses WU-BLASTN with default parameters.

Example 7-4. The oligo isn't found

Reference:  Gish, W. (1996-2000) http://blast.wustl.edu

Notice:  this program and its default parameter settings are optimized to find

nearly identical sequences rapidly.  To identify weak similarities encoded in

nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query=  oligo

(25 letters)

Database:  na_whole-genome_genomic_dmel_RELEASE3.FASTA

7 sequences; 124,181,667 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

Smallest

Sum

High  Probability

Sequences producing High-scoring Segment Pairs:        Score  P(N)      N

*** NONE ***

There are, of course, many reasons why you might not be able to identify an oligo in the Drosophila melanogaster genome. First, the oligo might contain repetitive sequence and thus be masked out. However, because WU-BLAST doesn't mask by default, that can't be the reason. Second, the assembled genome may be incomplete. Every sequenced genome to date is incomplete to some degree. In fact, a 99 percent complete 124mb genome is still missing 1.24 mega-bases of a euchromatic (nonrepetitive DNA) sequence, leaving plenty of space for an oligo to go missing in. The incompleteness of the genome is a possible explanation for our WU-BLAST result, but is it the correct one? Before concluding that the oligo falls into a sequencing gap, let's try to run NCBI-BLASTN with its default parameters. Aha! The NCBI-BLASTN results in Example 7-5 show that the oligo is present in the Drosophila melanogastergenome and the HSP is assigned a significant Expect.

Example 7-5. Using NCBI-BLASTN to find the oligo

Sequences producing significant alignments: (bits) Value

2R 2R.3 assembled 23-11-2001 50   1e-06

X X release:2 length:21666217bp Assembled X chromosome arm seque...    32   0.25

3R 3R.3 32   0.25

U GenomicInterval:U    30   0.99

3L 3L.3 v.3e  23351213bp BCM HGSC guide:3l-mtp-eval.08apr02      28   3.9

2L 2L release:3 length:22217931bp Assembled 2L chromosome arm se...    28   3.9

>2R 2R.3 assembled 23-11-2001

Length = 20302755

Score = 50.1 bits (25), Expect = 1e-06

Identities = 25/25 (100%)

Strand = Plus / Plus

Query: 1 tacatccggcacttagccgggctcg 25

|||||||||||||||||||||||||

Sbjct: 16190927 tacatccggcacttagccgggctcg 16190951

Results like these frustrate a lot of BLAST users. Why does NCBI-BLAST find the oligo when WU-BLAST doesn't? The results may seem contradictory, but they make perfect sense, and understanding why this is so will help you use Karlin-Altschul statistics to ask questions about your own BLAST results.

7.3.1 Karlin-Altschul Statistics as a Tool for Further Investigation

Parameter choice seems a likely explanation for the results shown in Examples Example 7-4 and Example 7-5. If you assume the failure of WU-BLASTN to report the alignment isn't due to a bug, maybe the hit wasn't significant in the context of the current search. By now you've been exposed to enough Karlin-Altschul statistics to know that BLAST parameters determine the significance of an alignment. The scoring scheme used in a search is the fundamental BLAST parameter, so you should begin your investigation there.

WU-BLASTN and NCBI-BLASTN have very different default scoring schemes. WU-BLASTN uses a +5/-4 scheme to score alignments by default, whereas NCBI-BLASTN uses a +1/-3 scheme. One central theorem of Karlin-Altschul statistics is that every scoring scheme is implicitly a log-odds scoring scheme, and every log-odds scoring scheme implies a target frequency. This insight is significant because it means that a scoring scheme always hunts for alignments having a particular percent identity. One great thing about Karlin-Altschul statistics is they enable you to calculate that percent identity.

Chapter 4 provides a Perl script called Qcalc for calculating various nucleotide scoring scheme target frequencies; Table 4-1 summarizes the l and percent identity implied by some common scoring schemes. The default +5/-4 scoring scheme used by WU-BLAST implies an ungapped target frequency of 65 percent identity, whereas the +1/-3 default scheme used by NCBI-BLASTN looks for alignments with 99 percent identity. But why would a +5/-4 scoring scheme miss a real alignment with 100 percent identity? The answer to this question lies in the value of l and the role played by gap penalties.

Table 7-3 gives 0.104 nats for the value of gapped l associated with the WU-BLASTN default scoring scheme and gap penalties. On the other hand, the l associated with the NCBI-BLASTN default scoring scheme and gap penalties is 1.37 nats (Table 7-4). The difference between the two values is tenfold. To see how the value of l impacts your search results, use Karlin-Altschul statistics to delve more deeply into the relationship between raw scores, normalized (bit and nat) scores, and l.

 Table 7-3. Selected WU-BLASTN parameters and values from the search shown in Example 7-5 Parameter Value l 0.104 nats (gapped) k 0.0151 nats (gapped) H 0.0600 nats/aligned residue m 25 (length of the query sequence) n 124,181,667(number of letters in the database) Number of sequences in database 7
 Table 7-4. Selected NCBI-BLASTN parameters and values from the search shown in Example 7-5 Parameter Value l 1.37 nats (gapped) k 0.711 nats (gapped) H 1.31 nats/aligned residue m 25 (length of the query sequence) n 124,181,667 (number of letters in the database searched) Number of sequences in database 7

First, see how the value of l impacts the expected HSP length. Recall that this value is the length of an alignment that has an Expect=1; more precisely, it's the expected HSP length for E equal to 1 associated with a scoring matrix for a search space of size m´*n´. This value is reported in the footer of NCBI-BLAST reports, where it's called "Effective HSP length." However, it's absent from the WU-BLAST footer, so you should calculate it with the Perl function expectedHSPlength and the information from Tables Table 7-3 and Table 7-4.

Please note that the following function shows the traditional way to calculate the expected HSP length. Recent work by Altschul and colleagues suggests that this function overestimates the effective HSP length for short sequences, and the manner in which it's calculated may change in the future. Therefore, use the value reported in the BLAST report footer, if it's available. For the purposes here, though, this function is fine.

sub expectedHSPlength{

my \$k = shift;

my \$m = shift; # actual length of query

my \$n = shift; # actual length of query

my \$h = shift; # average nats/aligned pair

# l = ln(kmn)/H

return  log(\$k*\$m*\$n)/\$h;

}

expectedHSPlength returns an expected HSP-length of about 16 nucleotides for the NCBI defaults. The expected length of the WU-BLASTN HSP with Expect = 1, however, is higher—about 294 nucleotides. That's a big difference. Once again, the reason for the difference lies with the scoring matrix. Recall that the implied target frequency for the NCBI-default +1/-3 scoring scheme was 99 percent, but it was 65% for the WU-BLAST defaults (see Table 4-1). This is why the effective HSP length for the WU-BLAST search is so much longer. The hypothetical 294-nucleotide alignment is expected to have a percent identity of less than 65 percent. In other words, taking into account mismatches and gaps, it needs to be 294 bases long to attain a raw score sufficient to generate an Expect of 1. Thus, the WU-BLAST defaults implicitly assume that nucleotide homologies will have low identity (<65%), but be long—294 nucleotides in the context of the Drosophila melanogaster genome. Is this biological assumption valid? Yes and no.

The WU-BLASTN defaults are well suited for detecting long regions of low identity such as poorly conserved exons. On the other hand, the NCBI-BLAST parameters are suitable for finding shorter but nearly identical sequences. Both sets of default parameters will likely fail to detect other kinds of homology, especially short, conserved sequences such as cis-regulatory elements, which tend to be highly conserved and are often less than 10 nucleotides long.

Just how short can an HSP be and still generate a significant hit using WU-BLASTN defaults? Again, Karlin-Altschul statistics provide a basis for answering this question. First you need to know what raw score corresponds to an Expect of 1. The following Perl function calculates this value:

sub rawScoreOfExpectOne {

my \$k = shift;

my \$m = shift; # actual length of query

my \$n = shift; # actual length of database

my \$l = shift; # gapped lambda in nats

# SE=1 = ln(kmn)/l

return  log(\$k*\$m*\$n)/\$l;

}

For the ~124 mega-base Drosophila melanogaster genome, that raw score is about 15 with the NCBI-BLASTN defaults and about 170 for WU-BLASTN. Recall that the maximum score for aligning an oligo-nucleotide 25 bases long under the +5/-4 scheme is 125 (25*5). Even an alignment with an Expect of 1 has a raw score (170) greater than the maximal attainable score for a 25-mer under the WU-BLAST defaults! This is another reason why WU-BLASTN didn't report a hit.

To determine the length of an ungapped alignment that has 100 percent identity for a given raw score, divide the raw score by the match score. By this calculation, any oligo shorter than about 15 nucleotides (15/1) for NCBI-BLASTN and 34 nucleotides (170/5) for WU-BLASTN will have an Expect > 1. This means that the NCBI-BLASTN defaults are fine for mapping oligo-nucleotides to the Drosophila melanogaster genome. On the other hand, it appears that looking for short—less than 15 base-pair—cis-regulatory elements using either version of BLASTN with the default parameters is unlikely to be successful.

So what was the unreported WU-BLASTN Expect? Let's calculate it. With the data in Table 7-3 and the previously calculated effective HSP length of 294, first calculate  and  using the Perl functions effectiveLengthSeq and effectiveLengthDB. Plugging  and  together with the WU-BLASTN l and k and a raw score of 125 into the rawScoreToExpectfunction gives an Expect of 281. Recall that the NCBI-BLASTN Expect was  1e-6. That's a 281-million-fold difference. BLAST is clearly parameter-sensitive! Using the default parameters, you instructed NCBI-BLASTN to search for short highly conserved regions, and it found one. WU-BLASTN, on the other hand, is parameterized to look for large regions of relatively low percent identity. This would be fine for cross-species searches of poorly conserved exons but is inappropriate for finding oligos.

Using BLAST intelligently requires using the correct parameters for the task at hand and not placing too much faith in the reported Expect. See the section on BLAST protocols in Chapter 9 for practical suggestions on BLAST parameter choice. Remember, you get what you look for.

7.3.2 What It All Means

You now know how bit scores, sum scores, Expects, and P-values are calculated. You've also seen first-hand that scoring matrices and target frequencies aren't merely theoretical abstractions but realities that determine the outcome of a BLAST search. In some ways, choosing the right scoring scheme for a BLAST search is like choosing the right pair of eyeglasses. If your scoring scheme is too stringent, BLAST becomes nearsighted and will miss distant homologies. If your scheme is too lenient, BLAST becomes farsighted and fails to detect the obvious. Unfortunately, there's no optimal scoring scheme. As in real life, sometimes the best you can do is put on bifocals.

You've also seen that searching the same sequence and database with varied parameters can result in different alignments having very different Expects. Scores and E-values aren't implicit in a sequence or an alignment; they are solely contingent upon parameter values and the methods used to assess significance. There is nothing absolute about a BLAST significance value; it merely denotes the significance of an alignment in the context of a given search. Like everything else in bioinformatics, the biological implications of a (significant) alignment are inferred by the user and should be tested experimentally, if possible.

Hopefully, you've also learned that there is more to Karlin-Altschul statistics than simply calculating an Expect for an alignment. Karlin-Altschul statistics provide a theoretical framework from which to interpret alignment scores in the context of parameter choice. They also give you the means to tune BLAST for specific purposes. Without them, you'd have no way of knowing what a given scoring scheme was looking for, and you'd cast around in the dark for the right set of parameters. Karlin-Altschul statistics remove the mystery from parameter choice. BLAST certainly has its limitations, but thanks to its statistical foundation, at least you know what you're looking for. ﻿﻿