BLAST (Basic Local Alignment Search Tool)

Chapter 8. 20 Tips to Improve Your BLAST Searches

 

8.1 Don't Use the Default Parameters

You shouldn't use BLAST the way you use an Internet search engine such as Google. BLAST results are very sensitive to parameters, and the defaults aren't suitable for all searches. Historically, BLAST parameters have changed periodically. In addition, the NCBI-BLAST and WU-BLAST defaults have very different properties, and your results may differ depending on where you perform your BLAST search. Thus, you should get out of the habit of using default parameters. There are situations in which the default parameters are just fine, but using them knowledgeably and accepting them out of ignorance are two very different motivations. If you don't know which settings are appropriate for a particular search, you're not alone; most BLAST users don't know how to set up a search. That's why we wrote this book, so keep reading.

8.2 Treat BLAST Searches as Scientific Experiments

Scientists are often taught to structure their experiments into four parts: the question (hypothesis), the approach (experimental design), the results (data), and the interpretations(beliefs). This approach shows that beliefs depend on the experiment's data. Whether or not an experiment is capable of answering the question is  one way to separate good science from bad.

When setting up a BLAST experiment, the most important thing to remember is "you get what you look for." In other words, search parameters determine what you find. For example, the BLASTN program from NCBI with the default settings assumes that the alignments you are seeking are nearly identical because the parameters (match +1, mismatch -3) have a target frequency of 99 percent identity. If your experimental question is "How many worm genes are related to my favorite human gene," using the default parameters would be foolish because the approach (looking for nearly identical sequences) isn't expected to answer the question; too many sequences have changed in the 500 million years that separate worms and humans.

8.3 Perform Controls, Especially in the Twilight Zone

Controls are crucial to any scientific experiment. The random model underlying BLAST statistics provides one kind of control, but performing an explicit control can give you greater confidence in your results. This is especially true when looking for weak similarities, commonly called the twilight zone. One of the simplest and most effective ways to determine if an alignment is believable is to shuffle your query sequence and repeat the search. If the shuffled sequence returns similar results, the alignment is based on compositional biases or the search parameters aren't specific enough. The following Perl script shuffles a FASTA file:

#!/usr/bin/perl -w

use strict;

my ($def, @seq) = <>;

print $def;

chomp @seq;

@seq = split(//, join("", @seq));

my $count = 0;

while (@seq) {

    my $index = rand(@seq);

    my $base = splice(@seq, $index, 1);

    print $base;

    print "\n" if ++$count % 60 == 0;

}

Now let's put this script into action. Let's make the dubious hypothesis that ALU repeats aren't specific to primates but are present in all genomes. They haven't been found because people just haven't looked hard enough. Your search parameters use +1/-1 match/mismatch scores and a gap opening cost of 1 and extension cost of 1. (WU-BLAST users would understand this as a cost of 2 for the first gap and 1 for each additional gap). Figure 8-1a shows an alignment between a human ALU (a variety of repeats are available from ftp://ftp.ncbi.nih.gov/repository/repbase) and the Caenorhabditis elegans genome (see http://www.wormbase.org). Without a control, you might be able to convince yourself that you found a match to a C. elegans ALU. However, because a shuffled control (Figure 8-1b) produces an alignment that is approximately 100 times more significant, this conclusion isn't very likely.

Figure 8-1. Searching (a) an ALU element and (b) a shuffled version against the C. elegans genome

figs/blst_0801.gif

You might wonder why the alignments in Figure 8-1 seem to have significant E-values. The search employed low gap penalties and ungapped alignment statistics. When using gapped alignment statistics, these alignments are expected at random.

8.4 View BLAST Reports Graphically

BLAST reports can be complicated. Viewing them graphically can help you understand them better, especially when the reports are very long. Appendix D includes a simple Perl program that converts tabular output from NCBI-BLAST reports into a GIF, PNG, or JPEG image. Figure 8-2Figure 8-4Figure 8-8, and Figure 8-9 were created with this program. Appendix E contains a program that converts the standard BLAST reports to the NCBI tabular format. The programs are available at this book's O'Reilly web site.

Figure 8-2 is an example of a BLASTX search. Appendix D contains more information on the display.

Figure 8-2. A graphical view of a BLASTX report

figs/blst_0802.gif

8.5 Use the Karlin-Altschul Equation to Design Experiments

The Karlin-Altschul equation is very useful for predicting the outcome of a BLAST experiment, especially in large search spaces. Suppose you want to find exons in the human genome by looking for similarities in the pufferfish genome. These genomes last shared a common ancestor about 450 million years ago. You might assume that any similarities at this distance must be due to evolutionary conservation.

Recall from Chapter 4 that the number of alignments expected by chance (E) is a function of the search space (M, N), the normalized score (lS), and a minor constant (K).

figs/equation08aa.gif

The typical cross-species parameters +1/-1 match/mismatch have a target frequency of 75 percent identity and 0.55 nats per aligned letter on average (H). A 50-bp alignment therefore contains about 27.5 nats. Substituting this normalized score into the Karlin-Altschul equation with K=0.334, M=1.5 GB (assuming half of the human genome contains repeats), and N=450 MB (the size of the repeat-poor pufferfish genome), you expect about 230,000 alignments by chance. That's roughly the same as the number of exons in the human genome. If you want to look for 50-bp exons, you'll have to sift through a lot of false positives.

To change the Karlin-Altschul expectation to something more manageable, either look for larger exons or reduce your search space. A 72-nucleotide alignment is expected only once by chance, and an alignment the size of a typical exon (110 bp) has a probability of about 1 in 1 billion of occurring. An even better approach is to restrict the search to orthologous regions of the size of a typical gene. Here 50-bp alignments have a probability of approximately 1 in 10,000.

8.6 When Troubleshooting, Read the Footer First

Novices usually focus on the one-line summaries, regular users concentrate on the alignments and their statistics, and professionals first read the footer. When it comes to solving the two most common problems, no hits and too many hits, the one-line summaries aren't much help. Regular users can often look at alignments and diagnose compositional biases and unidentified repeats, but determining the cause of no hits isn't easy. Examining the footer to determine what the search was actually looking for is the best way to determine what happened. Always answer the following questions first:

· What are the values for the seeding parameters W, T, and two-hit distance? If the seeding parameters are too stringent, divergent alignments may not be seeded. In NCBI-BLAST, W is unfortunately not displayed in the footer. The value for T and two-hit distance are given as T: and A:, respectively.

· What is the scoring scheme expecting to find (i.e., target frequency)? If the scoring matrix expects nearly identical sequences, highly divergent sequences may be missed.

· What is the alignment threshold? If the alignment threshold is too high, low scoring alignments will be thrown away. The gapped and ungapped values are given after S1: and S2:in NCBI-BLAST. In WU-BLAST, they are on the  rows beneath S2.

· What are B and V set to? If they are set too low, the number of one-line summaries and database hits may be truncated.

· What is the score and expected length of a significant alignment? Use the Karlin-Altschul equation to solve for the normalized score and then divide by H to calculate the length.

· Was complexity filtering employed, and if so, was it hard or soft? Complexity filtering is generally a good idea, but may prevent some sequences from generating significant alignments. NCBI-BLAST doesn't not currently report which filters were employed.

8.7 Know When to Use Complexity Filters

Low-complexity sequence occurs much more frequently than expected by chance in both proteins and nucleic acids. When a BLAST search takes longer than expected, it is almost always due to low complexity sequence or repeats. Low-complexity filters can sometimes be destructive. Figure 8-3a shows what happens when a query sequence is filtered: the low complexity region is replaced with Xs (or Ns for nucleotide sequences). This operation always reduces the score and can terminate an alignment extension. For this reason, it is almost always better to use soft-masking (see Figure 8-3b). This technique masks low-complexity sequence in the seeding phase but allows the extension phase to see the sequence normally. See -F in Chapter 13 and wordmask in Chapter 14.

Figure 8-3. Complexity filters (a) hard-masking and (b) soft-masking

figs/blst_0803.gif

What if your query is almost entirely low-complexity? If soft-masking doesn't work, you may have to perform the search without complexity filters. In this case, expect many false-positive alignments and a slow search. Setting a lower E-value to remove low-scoring alignments can help reduce the size of the output.

8.8 Mask Repeats in Genomic DNA

As mentioned in Chapter 2, genomes may be full of repetitive elements and low-complexity sequence. They can be very problematic in BLAST searches, and if not masked prior to a BLAST search, will waste computer time and inflate BLAST reports with meaningless, redundant information (Figure 8-4).

Figure 8-4. BLASTX search with (a) repeats intact and (b) repeats masked (the alignments were removed from the display)

figs/blst_0804.gif

8.9 Segment Large Genomic Sequences

Nucleotide sequences can be very, very long. For example, the shortest human chromosome, number 22, is over 47 million bp. BLAST wasn't designed for large sequences and runs poorly in such an environment. You can easily run out of memory with chromosome-sized sequences. Even if you have a computer with sufficient memory, searching large sequences is inefficient because the procedure for assessing combined statistical significance scales quadratically with the number of alignments.

The simplest way to deal with large sequences is to split them into overlapping fragments. For genomes with high gene density, each fragment should be 100 Kb or less. For the human genome and others with low gene density, fragments can be larger, but try not to exceed 1 Mb.

The following Perl script splits a FASTA file into overlapping fragments. Each sequence fragment is given a unique identifier, and the definition contains the original coordinates and complete definition.

#!/usr/bin/perl -w

use strict;

die "usage: $0 <fasta file> <size> <overlap>\n" unless @ARGV == 3;

my ($file, $size, $overlap) = @ARGV;

my $def = "";

my $dna = "";

my $sequence = 0;

my $fragment = 0;

open(IN, $file) or die;

while (<IN>) {

    chomp;

    if (/^>(.+)/) {

segment(  );

$def = $1;

        $sequence++;

$fragment = 1;

$dna = "";

    }

    else {

$dna .= $_;

    }

    while (length($dna) > $size) {segment(  )}   

}

segment(  );

sub segment {

    return unless $dna;

    my $output = substr($dna, 0, $size);

    if (length($output) == $size) {

$dna = substr($dna, $size - $overlap);

    }

    else {

$dna = "";

    }

    my $start = ($fragment -1) * ($size - $overlap) + 1;

    my $end = $start + length($output) -1;

    print ">$sequence-$fragment {$start..$end} $def\n";

    for (my $i = 0; $i < length($output); $i+= 80) {

print substr($output, $i, 80), "\n";

    }

    $fragment++;

}

8.10 Be Skeptical of Hypothetical Proteins

Amino acid sequencing is more difficult than nucleic acid sequencing, and therefore, sequences of most proteins are inferred from DNA translations. Some inferences come from gene predictions and others come from transcript translations. Finding the correct structure of genes in genomic DNA is very difficult; algorithms are incomplete approximations, and people make mistakes. Some research groups are conservative and only report proteins when there is good evidence. Others submit hypothetical proteins and hope that they will be useful (and they often are). As a result, many proteins in the public database are slightly incorrect or even fictitious. Unfortunately, hypothetical gene structures aren't always clearly labeled.

The most accurate protein sequences come from translating full-length cDNAs. But determining the protein encoded by a transcript isn't as simple as it sounds. While there is usually only one long open reading frame (ORF), the longest ORF won't necessarily correspond to a real protein. Be suspicious of all short proteins. Even in a full-length cDNA with a very large ORF, determining the start of translation isn't straightforward. The first methionine in the longest ORF is usually picked as the start of translation, but as a rule of convenience, not a biological truth. Many protein sequences have erroneous N-terminal extensions.

8.11 Expect Contaminants in EST Databases

A simple view is that ESTs are sequencing reads from cDNAs, cDNAs are derived from mRNAs, and mRNAs are derived from genes. Theoretically, this is true, but in practice ESTs frequently don't correspond to genes (e.g., rather than match an exon or UTR, they overlap part of a repeat on the wrong strand within an intron). The fraction of nontranscript sequence depends on the way the library was created. Some libraries are nearly devoid of extra-genic material, while others are essentially random shotgun sequence. How can you tell the difference? It's difficult to determine directly from the EST sequences.

Before the human genome was completed, the number of genes was estimated at 100,000 to 200,000. Current estimates are 25,000 to 30,000. One of the reasons for the initial high figure was that EST clustering experiments found many clusters, and people believed each cluster was a gene. One of the best ways to sort out real transcripts from pollutants is to align ESTs back to their genome. See Section 9.1.5 for more details.

8.12 Use Caution When Searching Raw Sequencing Reads

The largest source of raw sequencing reads comes from the early stages of genome projects and from EST sequencing. Most sequencing reads have an error rate of about 1 percent. This rate isn't uniform; there is a spike near the beginning and a gradual increase towards the end of the read. In addition, some regions have intrinsically high error rates due to compositional properties such as high GC content. DNA sequencing involves several steps, and there are abundant opportunities for mechanical and human error. Thus, you will need to be careful when using large word sizes. For redundant sequence collections, such as 3x shotgun coverage of a genome, large word sizes are fine, but if the absence of a single alignment is troublesome, scale down the word size to keep sequencing errors from preventing seeding.

Raw sequencing reads may be contaminated from a variety of sources. Cloning vectors are one expected source. Depending on the sequencing center, the vectors may or may not have been clipped from the sequence. Other kinds of contamination are also possible. Nuclear DNA is sometimes contaminated with mitochondrial or viral DNA, and any collection of sequence can be contaminated from another organism (genome centers usually sequence more than one entity at a time, and sometimes there's a mix up of who did what and when). ESTs sometimes have their poly-A tail intact, and whether or not this is contamination is a matter of perspective. Taken together, there are many opportunities for contamination, and it's a good idea to be cautious when using raw sequencing reads.

8.13 Look for Stop Codons and Frame-Shifts to find Pseudo-Genes

Stop codons generally aren't found in protein-coding genes. They are common, however, in pseudo-genes. It's important to recognize pseudo-genes early in a sequence-analysis pipeline because they may confound downstream analyses. Pseudo-genes usually have stop codons and insertions or deletions (Figure 8-5). Stop codons are easy to spot in alignments, but insertions and deletions must be inferred from alignment coordinates. Look for HSPs that are in different frames and appear too close to be separated by an intron (< 25 bp).

Figure 8-5. BLASTX alignment of a pseudo-gene (stop codons are circled)

figs/blst_0805.gif

8.14 Consider Using Ungapped Alignment for BLASTX, TBLASTN, and TBLASTX

The first versions of BLAST produced strictly ungapped alignments but were still very useful. Although gapped alignment has some advantages, it may produce surprising results. When running the translating BLAST programs (BLASTX, TBLASTN, and TBLASTX), you generally look for protein coding regions and therefore don't expect to see stop codons. Stop codons are very frequent in alignments from these programs, and it isn't possible to eliminate stop codons by simply making their scores highly negative. In standard alignment algorithms (see Chapter 3), no match score can be more negative than the cost of two gaps. In Figure 8-6, all stop codon scores are given a value of -999 (for more details, see Chapter 10). Notice how two alternating gaps skip over the stops in this TBLASTX alignment between two noncoding sequences (this is a WU-BLAST alignment; NCBI-BLAST is always ungapped for TBLASTX). You can avoid stop codons only by using ungapped alignment in addition to highly negative stop scores. Doing so segments the alignment in Figure 8-6 into three short alignments with insignificant E-values.

Figure 8-6. Alternating gaps skip over highly negative scores

figs/blst_0806.gif

Figure 8-7 demonstrates another feature of gapped alignment: alignments may extend far beyond the end of an exon because gapped extension is generally less specific. This is especially annoying in genomes with short introns in which gapped alignments can extend between nonadjacent exons and obscure intervening introns and exons. To reduce these lengthy extensions, decrease X, increase the gap extension cost, select a more stringent scoring matrix, or use ungapped alignment.

Figure 8-7. Extension is sometimes excessive: the real exon region is boxed in this BLASTX alignment

figs/blst_0807.gif

8.15 Look for Gaps in Coverage as a Sign of Missed Exons

The seeding parameters and alignment thresholds may prevent short or highly divergent exons from appearing in BLAST reports. Figure 8-8a shows an alignment between a genomic query and an EST. Most alignments overlap by a few bp, except for the 2 at the 5´ end (left side). Gaps and overlaps in coverage are easier to see by using the reciprocal search shown in Figure 8-8b. To find the missing 7-bp exon in Figure 8-8c, use bl2seq (see Chapter 13) with the following command line:

bl2seq -i est -I 21,29 -j genomic -J 76047,76744 -pblastn -W 7

The -I and -J parameters let you select a specific region of each sequence. What you've done is a BLASTN search between the missing part of the EST and the region between the alignments.

Figure 8-8. Finding missed exons: (a) an alignment between a genomic query and EST, (b) the reciprocal alignment showing a gap (d) and overlap (e) in coverage, (c) the tiny missed exon can be found (f) by changing the word size to 7

figs/blst_0808.gif

8.16 Parse BLAST Reports with Bioperl

The traditional BLAST output format is meant to be human readable, but when your BLAST report is 1,000 pages long, it isn't much fun to read. Sometimes all you want is the names of all sequences that have alignments above 90 percent identity. Such tasks require a BLAST parser that lets you select only the information you want. Many freely available BLAST parsers can be downloaded from the Internet, but the ones in most common use come from the Bioperl project. Bioperl is an open-source community of bioinformatics professionals that develops and maintains code libraries and applications written in the Perl programming language. If your daily routine finds you running BLAST or other sequence analysis applications, learning to use the Bioperl system can save you many hours of work and frustration.

Let's see how Bioperl can help solve the problem posed earlier: to report the names of all sequences that are more than 90 percent identical to your query.

#!/usr/bin/perl -w

use strict;

use Bio::SearchIO;

my $blast = new Bio::SearchIO(

    -format => 'blast',

    -file   => $ARGV[0]);

my %Name;

my $result = $blast->next_result;

while(my $sbjct = $result->next_hit) {

    while(my $hsp = $sbjct->next_hsp) {

$Name{$sbjct->name} = 1 if $hsp->frac_identical >= 0.9;

    }

}

print join("\n", sort keys %Name), "\n";

Pretty simple, huh? With BLAST and Bioperl, it's possible to create all kinds of useful applications.

8.17 Perform Pilot Experiments

Before embarking on a large BLAST experiment, first try some pilot experiments. For example, if you want to compare all human proteins to all nonhuman proteins, try 100 proteins first. Or, if you want to annotate a 5 mb chromosomal region with BLASTX similarities, search 100 Kb first. If you're unsure of which parameters to use, try several and see which ones give you the kinds of results you're looking for. It may seem like a waste of time, but performing pilot experiments will actually save you time in the end.

8.18 Examine Statistical Outliers

In a high-throughput setting, BLAST reports may be huge and number in the thousands. There's no way you can look at all of them, but for quality control, you should examine some of them. Keep global statistics on BLAST reports, such as number of hits per Kb. Statistical outliers may point to general problems that become more apparent in certain sequences.

8.19 Use links and topcomboN to Make Sense of Alignment Groups

WU-BLAST has two very useful parameters for displaying alignment groupings. topcomboN sorts alignments into groups and labels them. The links parameter shows the order of alignments in a group, which is much like the order of a gene's exons. Figure 8-9 displays these features.

Figure 8-9. WU-BLAST topcomboN and links (the top-to-bottom order of alignments in the graphic (a) are the same as the statistics lines from the BLASTX report (b))

figs/blst_0809.gif

8.20 How to Lie with BLAST Statistics

Several techniques can help you massage BLAST statistics to either hide significant alignments or make meaningless alignments appear highly significant. Why would you want to do this? If you have to ask, you're not the intended audience. Dishonest evil doers read on.

The easiest method to adjust the significance of all scores is to set the effective size of the search space either higher or lower. Command-line parameters in both NCBI-BLAST (-Y) and WU-BLAST (Y and Z) are available. You can also alter the scoring scheme by editing the scoring matrices. A more involved approach involves hacking the source code to set your own values for l, k, and H. WU-BLAST makes it all too easy because you can alter scores or set Karlin-Altschul parameters on the command line. Whatever approach you take, you will, of course, want to edit the footer to cover your tracks. The easiest way to do this is to run the search twice and diff the footers to determine what needs fixing.

With low gap penalties, you can make alignments between just about anything. For BLASTN, NCBI-BLAST always uses ungapped statistics, so you don't have to do much work to lie. Just hope that nobody notices all the gaps. This works best if you have a supervisor who is either too busy to look at alignments or wouldn't know a decent alignment if it bit him. NCBI-BLAST is very restrictive about what gap penalties you can employ for the protein-based BLAST programs. Your only choice here is to hack and recompile. WU-BLAST is very easy; set your gap costs low and include warnings on the command line to suppress messages about ungapped statistics.

Another way to trick the unobservant is to remove complexity filters. This works especially well when claiming that some anonymous low-complexity region or transcript is a cool gene. You can almost always find a small ORF that has a poor match to something with an interesting definition line. A poor match is only poor if you don't know how to fix the statistics. This approach even works when fooling scientific journals. (It really does. We've seen it happen.)