BLAST (Basic Local Alignment Search Tool)

Chapter 9. BLAST Protocols

This chapter contains protocols for the most common BLAST searches. Because every BLAST experiment is unique, you should treat the protocols as a starting point and use your own knowledge about BLAST to modify the procedures. The discussions include what to do, as well as why. Although this approach makes the descriptions more verbose, explaining the logic behind these choices will help you make intelligent choices when creating your own protocols.

Most BLAST experiments fall into one of two categories: mapping and exploring. Mapping is the process of finding the position of one sequence within another—for example, finding a gene within a genome. When mapping, you can expect the alignments to be nearly identical, and the coordinates are generally the focus of the results. When exploring, the goal is usually to find functionally related sequences. When exploring, your alignment statistics (score, expectation, percent identity, etc.) are often of greatest importance, at least initially. Making functional and phylogenetic inferences, especially between distantly related sequences, often requires inspecting the alignments from a biological rather than a statistical perspective. There is, of course, a continuum between mapping and exploring, but keeping this dichotomy in mind can help you zero in on the fundamental aspects of a search strategy.

The notation used here and in the reference chapters in Section 5 is the command-line interface. If you're unfamiliar with shells and terminals, the command line is where you type in program names and options. It may seem a little odd at first, but it is analogous to filling in a web form and then clicking the submit button. While most people use BLAST via some web interface, not all pages look the same or support the same parameters. Behind the scenes, though, they all interact with a command-line version of BLAST. BLAST pages frequently let you set advanced options; usually it's a text box or boxes in which you can enter the command-line options.

9.1 BLASTN Protocols

As we said earlier, most searches can be categorized as either mapping or exploring searches. When sequences are expected to be nearly identical, you should use the +1/-3 match-mismatch parameters, which have a target frequency of 99 percent identity. Cross-species exploration requires a change in the scoring parameters and word size. We like +1/-1 for both its simplicity and its 75 percent identity target frequency. The choice of word size depends on balancing sensitivity and specificity. The default word size of 11 is too risky; use 9, which corresponds to a little more stringency than three identical amino acids because there's no allowance for degenerate codons. The choice of gap costs depends on the size of the expected gap. For simulating sequencing errors, the gap costs should be uniform and relatively high, but for modeling amino acid gaps or nucleotide hybridization bubbles, the cost of extension should be lower.

9.1.1 Mapping Oligos to a Genome

Many kinds of experiments, both molecular and computational, employ short nucleotide sequences called oligonucleotides, or just oligos (oligo is Greek for few). For example, the polymerase chain reaction (PCR) is a routine laboratory procedure for amplifying a specific nucleotide sequence from DNA or indirectly from RNA. In PCR, oligos are used as templates for DNA replication, and the subsequence between the oligos is amplified. The most important feature of oligos is they may be short enough to give rise to many false-positive matches. In a test tube, we would say the oligo hybridizes nonspecifically, and in a BLAST experiment, we would say the alignments have high expectations.

9.1.1.1 Approach

Our goal here is to simulate the interaction between an oligo and a genome in a test tube. The thermodynamics of annealing are complex and depending on the conditions of the experiment (temperature, salt concentration, length, and composition of oligo), some mismatches between the sequences and even gaps may be possible. Still, the sequences are expected to be nearly identical, so we use corresponding match-mismatch parameters. The default word size is fine here; we don't increase it because a fortuitous mismatch can prevent seeding for a short oligo. Complexity filtering is turned off because we want the entire oligo to match, and low complexity isn't expected to be a problem with such a short query sequence. Because there is quite a bit of variation from one oligo to the next, we can't set a specific E value. Instead, we use the default and visually inspect the report after the search.

9.1.1.2 NCBI-BLAST parameters

blastall -p blastn -d <genome> -i <oligo> -G 2 -E 1 -F F

megablast -d <genome> -i <oligo> -W 11 -F F -D 2

9.1.1.3 WU-BLAST parameters

blastn <genome> <oligo> M=1 N=-3 Q=3 R=1

9.1.1.4 Expected results

There may be several alignments between the oligo and genome, and not all of them may align end to end. If you are simulating PCR, mismatches at the 3´ end of the oligo are of particular interest because they may prevent priming.

If you don't find any hits, the oligo may be too short for its alignments to achieve statistical significance. For short oligos, even a 100 percent matching alignment may have a score that is expected at random in a large search space. Try raising E. Also, make sure that the scoring scheme favors near identity. Otherwise, lambda may transform the score to a very low amount of information, and you may not be able to set E high enough to recover the alignment. Other possibilities include too large a value for W or the use of complexity filters.

If you find too many hits, increase the stringency of the search by decreasing E. The suggested scoring scheme is already pretty strict, but you may want to set the gap penalties higher or turn off gapping entirely if you find too many gapped alignments. It may be that the query is just found in many places. If you don't care about the details of the alignments, tabular format is convenient to parse and takes up much less space. See Appendix A and Appendix E to learn how to report in tabular format.

9.1.1.5 Optimizations and variations

If you have many oligos to map, a technique called query packing (see Chapter 10) can greatly improve your speed. If you're interested only in exact matches, you can set the word size to the same size as the oligo. This is probably not a good simulation for what happens in a test tube, but it will make the search faster. Here, you might consider using MegaBLAST rather than BLASTN because it automatically packs queries and uses large word sizes but make sure the word size isn't larger than the oligo. If you want to search for oligos cross-species, be prepared to sift through many alignments because the expectation for low-scoring alignments may be very high.

9.1.2 Mapping Nonspliced DNA to a Genome

Many BLASTN searches fall into the same general category in which a moderately sized DNA sequence (usually around 500 bp) is used to query a genome. There are separate protocols for spliced query sequences, searching EST databases, and exploring distantly related sequences.

9.1.2.1 Approach

Our alignment parameters favor near identity and use a large word size to make the search faster. It is probably not necessary to set a value for E because the word size alone provides specificity. But if you lower the word size, you will want to set E to a small value that depends on your search space. The value presented here is only a suggestion. As with any search involving genomic DNA, you should mask repeats before you begin. See Chapter 7 for more details.

9.1.2.2 NCBI-BLAST parameters

blastall -p blastn -d <genome> -i <dna> -G 1 -E 3 -W 30 -F "m D" -U -e 1e-20

megablast -d <genome> -i <dna> -F "m D" -U -D 2

9.1.2.3 WU-BLAST parameters

blastn <genome> <dna> M=1 N=-3 Q=3 R=3 W=30 wordmask=seg lcmask E=1e-20

9.1.2.4 Expected results

True matches between the query and the genome ought to align from end to end with near identity. Because BLAST is a local alignment program, you can't require the alignment to cover the entire length of the sequence, so you just have to look for this property in the output. If the alignment doesn't go end to end, the sequence quality of the query might drop at the ends (which happens with raw sequencing reads).

Even if you mask repeats, they can still cause trouble. Some repeats aren't very abundant, or are limited to a particular region/chromosome and therefore may not be part of your repeat library. If your report is particularly long, look for regions that are over-represented in alignments. Graphical reports are very useful (see Appendix D). You may have to mask or omit troublesome regions by hand if they continue to give you problems.

Low-complexity sequences can also be problematic because not all instances are caught by the default parameters of complexity filters. You can further reduce nonspecific hits by filtering both the query and the database, but since the database is case-insensitive, soft-masking isn't an option. If you find that repeats or low-complexity matches dominate your report, you will probably have to run the report through a parser and select the hits that are nearly full length.

Genomes sometimes have regions of large duplications. While you may expect a single near-identity alignment, you can find multiple matches if your query has paralogs. Depending on how much time has passed since the duplication event, paralogs may be very distant or identical to one another.

9.1.2.5 Optimizations and variations

If you don't care much about the alignments, tabular output will help you read the report (see Appendix A and Appendix E). If you have a number of query sequences, this is a really good place to use MegaBLAST; it was designed for tasks such as this. For sequences that aren't expected to be identical, see Section 9.1.4.

9.1.3 Mapping a cDNA/EST to a Genome

Determining the correct structure of eukaryotic protein-coding genes isn't an easy task because genes are broken up into exons. One of the most accurate methods for determining exon-intron structure involves mapping transcripts back to their origin in a genome. This procedure sounds simple, but it is actually a bit complicated, and its difficulties shouldn't be underestimated. A related, but more difficult, problem is mapping transcripts between species (see the following protocol). Also see Section 9.1.5.

9.1.3.1 Approach

Most exons are 100-200 bp long, but there is a large range from only a few nucleotides to several kilobases. Generally, 99 percent of exons are larger than 50 bp, so large word sizes work fine. We suggest that you use typical near-identity parameters, but choose a word size that isn't quite as large as the previous protocol because it may be difficult to seed short exons with low quality sequence. As with any DNA search, the query should be repeat-masked prior to the search, and lowercase masking is preferred. Including a low value for Ereduces many nonspecific hits.  The proper value for E depends on the length of the query and the size of the database. The value given here is only a suggestion.

9.1.3.2 NCBI-BLAST parameters

blastall -p blastn -d <genome> -i <cDNA> -G 1 -E 3 -W 15 -F "m D" -U -e 1e-20

megablast -d <genome> -i <cDNA> -W 12 -F "m D" -U -t 21 -N 0 -D 2

9.1.3.3 WU-BLAST parameters

blastn <genome> <cDNA> M=1 N=-3 Q=3 R=3 W=15 wordmask=seg lcmask E=1e-20

9.1.3.4 Expected results

BLAST performs local alignments and doesn't explicitly model exon-intron structure or splice sites. For this reason, HSP endpoints aren't expected to correspond to exon boundaries, though they sometimes do. It is common for the alignment to be a few bps longer than the exon boundary on each side but in a low quality sequence, the alignment may be shorter than the exon. To determine whether you missed a short exon, look for a region of the query that isn't represented in any alignment. A graphical report is useful here (see Appendix D). If you find such a region, you may wish to search just this piece against the intron between neighboring exons with the bl2seq program using a shorter word size.

The same issues involving unidentified repeats, low-complexity sequence, and paralogs you encounter when mapping nonspliced sequences also apply here. Pseudo genes may also pose a problem. They are fairly easy to detect because they look like mRNAs embedded in the genome rather than real genes. See Section 9.1.5.

9.1.3.5 Optimizations and variations

If you have several sequences, MegaBLAST is a better choice than BLASTN. If your sequences come from different species, also see Section 9.1.4. Several programs model exon-intron structure, and they often give accurate results. But don't expect them to work every time because small exons, low quality sequence, repeats, gene duplications, etc., also affect these tools. Some of the most popular programs include SIM4, SPIDEY, and EST2GENOME. If you want to align ESTs and genomes from distant species, EST2GENOME is the best choice because it doesn't seed alignments with words.

9.1.4 Cross-Species Sequence Exploration

Comparative sequence analysis is a powerful approach for finding biologically important sequences. You may search for protein-coding genes, regulatory elements, RNA genes, or other regions of interest. In most cases, you expect the sequences to be similar but probably not identical. Most changes will probably be nucleotide substitutions, insertions, or deletions, but some may be more extreme. For example, genes may gain/lose exons or introns, repetitive elements may be inserted/deleted, and large-scale duplications, inversions, and deletions and other rearrangements may occur. Be cautious. This book doesn't include multiple cross-species protocols, so use this one to modify the other BLASTN protocols.

9.1.4.1 Approach

Because we don't expect the sequences to be identical, we use relaxed parameters for both seeding and alignment. Therefore, we use typical exploration parameters (+1/-1 and word size of 9) with soft masking. These parameters are similar to the following repetitive element identification parameters, but we choose higher gap penalties here because functional sequences usually have few gaps. The choice of E is left to you because there are many appropriate values, depending on your level of stringency. Should you set it high, you may also want to increase the output reporting options (-b and -v in NCBI-BLAST; B and V in WU-BLAST).

9.1.4.2 NCBI-BLAST parameters

blastall -p blastn -d <genome> -i <dna> -r 1 -q -1 -G 1 -E 2 -W 9 -F "m D" -U

9.1.4.3 WU-BLAST parameters

blastn <genome> <dna> M=1 N=-1 Q=3 R=2 W=9 wordmask=seg lcmask

9.1.4.4 Expected results

Be on the lookout for repetitive and low-complexity sequence and pseudogenes. Cross-species alignments are difficult to interpret because many factors impact DNA evolution. Not all sequences evolve at the same rate, and it is very easy to confuse signal and noise. It's a good idea to approach your findings with skepticism. Sequences that are nearly identical may indicate a very important biological signal, or they may represent sequencing contamination.

Low-scoring alignments may be coincidental similarities of no biological significance. If your search space is large, even high-scoring alignments are expected by chance. Work out the Karlin-Altschul expectation and search fabricated sequences to appreciate how frequently false positive alignments occur. That said, some biological signals are short and may be buried in the stochastic noise. The best way to deal with them is to reduce your search space. For example, if you are interested in determining if there is a short region of interest within the intron of a gene, try aligning the intron with the orthologous intron from another genome rather than the entire genome.

If you want to identify orthologs between genomes, the most common approach is to label the best reciprocal match to the ortholog. This approach can be confounded by paralogs, so take synteny into account if possible and look for homology that extends to neighboring genes.

9.1.4.5 Optimizations and variations

Changing word size and scoring parameters are some of the most obvious alterations you can make to the protocol. Adjusting word size by a single point can alter the speed by a factor of 3 (this is a rough estimate and applies only to relatively small word sizes). How seeding affects sensitivity depends on what you're searching for and the expected divergence. Other useful scoring schemes are given in Appendix B.

If you're interested in coding sequence similarities, TBLASTX is a better choice for more distant relationships. But since this program runs relatively slowly, you are better off with BLASTN for closer relatives. As a rule, if the expected identity is less than 70 percent, switch to TBLASTX.

MegaBLAST isn't recommended for cross-species searches. The new discontiguous version is designed for this task, but the effective word size, 14, is too high.

9.1.5 Annotating Genomic DNA with ESTs

Identifying genes in genomic sequence is a difficult and important task. None of the many experimental and computational approaches is foolproof. One useful technique is to identify related transcripts. The most common form of transcript information comes from ESTs.

ESTs are sequencing reads derived from the ends of cDNAs, and they therefore conceptually correspond to the transcripts of protein-coding genes. But not all ESTs encode proteins: mRNAs have untranslated regions at both ends, and many ESTs don't actually correspond to genes. Various techniques are employed to increase the proportion of less abundant transcripts, and while these techniques are useful for discovering genes with low levels of expression, they tend to increase the fraction of nontranscript sequences (otherwise know as "junk"). As a result, some EST collections contain a lot of sequence that doesn't correspond to any protein. Unfortunately, it is no simple task to determine which EST sequences in a database correspond to transcripts and which are junk.

9.1.5.1 Approach

Even though we expect many ESTs to align to the genomic DNA with near-identity, exploration parameters are often more appropriate than mapping parameters. There may be genes for which matching transcripts haven't yet been isolated but for which similar transcripts are available. These genes may come from the same or different species.

As usual, you should repeat-mask the sequence prior to the search and (preferred, but not necessary) use lowercase masking rather than Ns. Set a low E value to cut down on false positive alignments, and set the output options high because some regions are highly expressed and may prevent the display of real, low-scoring alignments.

After this search is performed, you will probably want to use specialized alignment algorithms to determine the exon-intron boundaries. See Section 9.1.1.5 section in the protocol in Section 9.1.3.

9.1.5.2 NCBI-BLAST parameters

blastall -p blastn -d <est_db> -i <genomic> -r 1 -q -1 -G 1 -E 2 -W 9 -F "m D" -U -e

1e-20 -b 100000 -v 100000

9.1.5.3 WU-BLAST parameters

blastn <est_db> <genomic> M=1 N=-1 Q=3 R=2 W=9 wordmask=seg lcmask E=1e-20 B=100000

V=100000

9.1.5.4 Expected results

The first thing to remember is that not all transcript matches correspond to a gene. There is quite a bit of variability from one region of a genome to the next; some regions have very few nontranscript matches while others are completely covered in junk alignments. Several features can separate transcripts from junk. Here are a few rules to remember:

Multiple HSPs

Eukaryotic genes usually have introns, so if a database match has only one HSP, it may be junk. However, some genes have only one exon, and some exons are longer than sequencing reads, so you can't rely on this rule. If the exon contains coding sequence, there ought to be a large ORF. It is possible for what should be a single HSP to look like multiple HSPs if the extension terminates (low sequence quality and hard-masking cause this). True splicing events are easily identified from their coordinates; there ought to be a large coordinate gap in the genome but not the EST.

Repeat proximity

Exons almost never overlap a repetitive element and are usually at least 20 bp away. If you mask repeats with Ns, you won't find repeat overlaps, so look for HSPs that abut repeats. Low complexity is a completely different issue, and transcripts often overlap short, low-complexity regions.

Conservation

Most real genes are evolutionarily conserved. Therefore, ESTs from multiple species ought to align to a gene if the organisms aren't too diverged. However, just because ESTs pile up on a particular region doesn't mean that a gene is there. Many pseudogenes have this property, as do unmasked repeats and low entropy regions.

cDNA library

cDNA libraries constructed with subtractive/selective hybridization, micro-dissected tissues, or PCR amplification usually have a lot of nontranscript sequences. You may wish to track down the literature references for suspect ESTs to determine how their cDNA libraries were constructed.

Developmental regulation

Genes are regulated in time and space, so not all of them may be present in a particular cDNA library. For example, for ethical reasons, it is more difficult to find genes expressed in the human egg than the chicken egg. Regulation also occurs at the level of splicing, so some exons may be absent at one time or another.

Internal priming

cDNA libraries are usually constructed using poly-T primers to bind to the poly-A tail of mRNAs. If the genomic sequence has a long run of As, the resulting ESTs may all appear to end there; the real transcript may be much longer.

Stacking

You can often discriminate exons from junk by simply viewing how the alignments stack up on the genome. Exonic regions generally have short HSPs with numerous alignments (except if the exon is very long). Junk regions usually have long alignments with little overlap. Visually, exons look like towers, and junk looks like stepping stones. But internal priming can make junk look like it has a defined endpoint. A graphical report is very handy here (see Appendix D).

9.1.5.5 Optimizations and variations

If your database is very large, you may consider increasing the word size. This will, of course, reduce sensitivity, but if you're only interested in nearly identical ESTs, you can change to typical mapping parameters, and you may want to change to MegaBLAST as well. To increase sensitivity, rather than decreasing the word size, you may consider TBLASTX if you are most interested in the coding sequences.

9.1.6 Transcript Clustering and Extension

cDNA libraries are often redundant, with a handful of highly expressed genes making up most transcripts. Clustering transcripts to create a representative set with less redundancy is therefore a common task. A variant of clustering is extension, in which ESTs are assembled into larger, more complete entities. Clustering and extension are difficult even for seasoned bioinformatics professionals. Treat this protocol as a starting or learning point. BLASTN isn't the best program for this specific task. Several software packages for clustering and extension already exist, and this protocol can help you understand their features.

9.1.6.1 Approach

This is an "all versus all BLASTN" procedure, so your computational time may be immense if you have a lot of sequences. It's one of the few cases when hard-masking is preferable because repeats and low complexity can confuse clustering or extension if the wrong associations are made. To err on the side of safety, we recommend masking your sequences before creating your database.

We expect the alignments to be nearly identical, except for sequencing errors and allelic differences (polymorphisms), so we use typical mapping parameters and a very large word size (WU-BLAST parameters use a slightly smaller word size; you can include WINK to reduce the number of seeds because this combination is more efficient). It may seem risky to use large words with data that is expected to contain sequencing errors, but because the dataset is potentially very large and we're primarily interested in long, highly specific alignments, the risk is worth taking. The word size has enough specificity that it is probably not necessary to set E, but we do so just in case. Finally, we set the output options to "high" in case some clusters are particularly deep.

9.1.6.2 NCBI-BLAST parameters

blastall -p blastn -d <db> -i <EST> -G 1 -E 3 -W 30 -U -v 10000 -b 10000 -e 1e-10

9.1.6.3 WU-BLAST parameters

blastn <db> <EST> M=1 N=-3 Q=3 R=3 W=15 WINK=15 filter=seg lcmask V=10000 B=10000

E=1e-10

9.1.6.4 Expected results

In the simplest case, EST overlaps can be followed in either direction to create longer, virtual transcripts. For clustering, the representative EST is usually the one with the most matches (the longest). These straightforward expectations have many potential problems. Here are some of the common ones:

Repeats and low complexity

Alignments may not be able to cross long repetitive regions. It is therefore possible for multiple HSPs to be present for sequences that are 100 percent identical. A second alignment with unmasked sequences can solve this problem.

Multiple isoforms

Some genes have multiple promoters or undergo alternative splicing and therefore produce multiple forms of transcripts. As a result, transcripts that are identical for much of their length may have discrepancies that correspond to unique or variant exons.

Chimeras

Cloning artifacts and lane tracking errors may join two sequences artificially. It is difficult to differentiate between chimeric sequences and isoform variants with just transcript alignments. Mapping the ESTs in their source genome is the best way to sort this issue out.

Paralogs

Some genes exist in multiple copies in a genome. These may be completely identical to one another or quite diverged. Determining if two nearly identical ESTs come from the same gene isn't as simple because it depends on the sequencing error rate and the level of polymorphism. Mapping transcripts to their genomic source can help solve this problem.

Internal priming

The presence of a poly-A tail is often taken as meaning the end of a transcript, but it may just be a run of A's in the middle of an exon. Real poly-A tails often have an AATAAA consensus sequence upstream, but a more reliable measure examines the genomic source to determine if the A's come from the genome or were added to a transcript.

9.1.7 Clustering with blastclust

Given a database of DNA sequences, it is often necessary to rapidly group related sequences for further analysis or simply identify redundancy at some level. One approach is to use BLASTN with rapid, insensitive search parameters, and then parse the output for the desired properties (e.g., 97 percent identity over at least 90 percent of the sequence length), and finally group all reads that are directly or indirectly (transitively) associated. Bioperl tools can automate such a procedure, but it takes a little work. The NCBI-BLAST distribution includes a standalone program called blastclust that is designed for just this task.

9.1.7.1 Approach

Two protocols are given below—one for clustering ESTs that are expected to be nearly identical across the length of the read (99 percent identity, 90 percent coverage), and another for shotgun sequences that have high identity over a smaller region of the read (97 percent identity, 10 percent coverage). The alignment parameters are preset for near identity, but some differences that may be the result of sequencing errors or polymorphism are allowed. Unlike other BLAST programs, blastclust doesn't allow soft masking.

9.1.7.2 EST clustering

blastclust -i <fasta file> -o <output file> -p F -L 0.9 -S 99 -b F

9.1.7.3 Shotgun sequences

blastclust -i <fasta file> -o <output file> -p F -L 0.1 -S 97 -b F

9.1.7.4 Expected results

The output from blastclust is one line for each cluster. Each line contains the identifiers for sequences in the cluster and may therefore be very long.

Repetitive elements are a problem because they may lead to false associations. This is especially true in the shotgun sequence approach where you're looking for high identity over short stretches. In contrast, the EST approach requires a high-identity match over a large portion of the sequence, making it less prone to small repeat or domain problems.

9.1.8 Vector Clipping

Vectors are DNA sequences used to clone (copy) fragments of DNA. They are commonly used in DNA sequencing. For various reasons, the vector DNA may inadvertently be present in a sequencing read. Therefore, a common practice in sequencing labs is to identify and remove vector sequences. This protocol describes how to identify vectors but not actually clip them. This additional step can be accomplished in many ways and is easily automated using the Bioperl tools.

9.1.8.1 Approach

Our goal is to take a batch of sequencing reads and search them against a database of vector sequences (a comprehensive database is distributed in GenBank). We expect vector sequences to align with near identity, so our parameters reflect this. The parameters here are almost the same as for oligo mapping because vector contamination may be relatively short. However, we add complexity filters because raw sequencing reads sometimes have an abundance of low complexity sequence, and we change the gap parameters to better simulate sequence error.

9.1.8.2 NCBI-BLAST parameters

blastall -p blastn -d <vector_db> -i <read> -G 1 -E 3 -W 10

9.1.8.3 WU-BLAST parameters

blastn <vector_db> <read> M=1 N=-3 Q=3 R=3 W=10 filter=seg

9.1.8.4 Expected results

Vector similarity usually occurs on one end of the query sequence, but it may not extend all the way to the end of the read if the sequence quality drops, and the alignment deteriorates. It's difficult to tell the difference between a short piece of vector contamination and a fortuitous similarity. If the alignment is at the end of the read and longer than 15 nucleotides, it's a good bet that the alignment is to vector.

Overcalling and undercalling are two potential problems in vector clipping. Poor sequence quality may lead to undercalling, so quality clipping usually precedes vector-clipping (this isn't a BLAST-based procedure). Undercalling can also occur if the value of E is set too high. Overcalling can result from believing that all alignments reported are vector similarities when they are really only expected at one end of the sequence.

9.1.8.5 Optimizations and variations

Query packing speeds up this BLAST search by a factor of 10. The standard vector database has many more vectors than may be used by the sequencing lab, so a good way to increase your efficiency is to minimize the vector database. You can use MegaBLAST here, though the default large word size poses a small risk for short regions of vector contamination.

9.1.9 Repeat Masking

Eukaryotic genomes often contain an abundance of repetitive elements. There are many kinds of repetitive elements, and these sequences may comprise most of a genome. Libraries of repetitive sequences are available from GenBank and elsewhere (http://www.girinst.org/Repbase_Update.html), but for newly sequenced organisms, you may have to build your own library.

9.1.9.1 Approach

Finding repetitive elements requires relaxed search parameters because their sequences are free to drift, and there is usually quite a bit of divergence within a particular family. We recommend soft masking rather than ordinary complexity filtering to ensure that the elements containing a low complexity sequence are aligned over their entire length. When choosing a value for W, we try to balance speed and sensitivity. While it may make sense to choose a very small value to ensure that all repeats are found, doing so isn't practical if you have to process many sequences. For WU-BLAST, we include the kap parameter, which omits calculating scores for combinations of alignments.

9.1.9.2 NCBI-BLAST parameters

blastall -p blastn -d <repeat_db> -i <dna> -r 1 -q -1 -G 2 -E 2 -W 9 -F "m D"

9.1.9.3 WU-BLAST parameters

blastn <repeat_db> <dna> M=1 N=-1 Q=2 R=2 W=9 wordmask=seg kap

9.1.9.4 Expected results

Alignments between repeats are expected to range from perfect identity to complete obscurity. The common classification scheme applies a score cutoff to discriminate between repeats and nonrepeats. Identifying the proper score threshold takes some experimentation because each repeat family has its own length and expected divergence. Overall, the score threshold determines the balance between undercalling and overcalling.

Some repetitive elements are mobile and may therefore insert themselves into other elements. If an insertion occurs near the end of an element, the alignment on the shorter side may fall below the score threshold.

9.1.9.5 Optimizations and variations

RepeatMasker (http://repeatmasker.genome.washington.edu) is the standard program used to identify and mask repeats. It uses a range of word sizes, scoring matrices, and cutoffs to optimize the sensitivity for each repeat family. One of its special features is that it clips full-length elements from sequences and performs a second round of searches with a "compressed" sequence. This enables it to find nested repeats. If your favorite genome is supported by RepeatMasker, it is probably better to use this software than write your own. However, if you want/need to do your own repeat masking, you will find that Bioperl tools are an enormous help.

9.1.10 Contaminant Detection

This protocol departs from the usual format because it is especially difficult and requires more than a single BLAST search. Contaminants come in many forms. Some, such as mitochondrial DNA mixed with nuclear DNA, are easily detected with near-identity parameters. But cross-species contamination is very difficult to detect. If you find an exact match between two genomes, is it contamination or a highly conserved region? There's no simple answer. Some genomes, however have specific signatures. For example, the human genome has many primate-specific Alu repetitive elements. If you find many Alu elements in a database of corn sequence, it's probably a contaminant.

The most critical part of contaminant detection is having representative databases. You can't find contaminants for which you have no sequences. On the other hand, if your sequence database is too large, you may spend an inordinate amount of time looking for contaminants. Repetitive element databases are good representative databases, and a reasonable approach to contaminant detection is to look for repeats that match other genomes better than your genome of interest. This won't catch everything, but it will tell you how much of a contaminant problem you may have.

9.2 BLASTP Protocols

Most BLASTP searches fall under the exploring category, which means you're trying to learn about your query sequence by comparing it to other proteins. You might also want to determine if particular regions are highly or not so highly conserved. Or you may want to gather proteins to build a phylogenetic tree. In any case, your main concern is how deeply you want to explore. The following protocols offer three levels of sensitivity.

9.2.1 The Standard BLASTP Search

Probably the most common BLAST search is BLASTP with default parameters. It is used in various settings because it balances speed and sensitivity. For example, if you want to compare all the proteins between two organisms, this is a good place to start. If the proteomes are very distant, the default parameters may not be ideal because alignments containing less than 35 percent identity aren't as easily detected. If the proteomes are very close, the standard search is still a good strategy because not all proteins evolve at the same rate, and some may diverge rather quickly.

9.2.1.1 Approach

We'll make only one adjustment to the default NCBI parameters. We use soft masking instead of normal complexity filtering so the entire alignment is scored. The WU-BLAST parameters are approximately the same as those of NCBI-BLAST.

9.2.1.2 NCBI-BLAST parameters

blastall -p blastp -d <db> -i <query> -F "m S"

9.2.1.3 WU-BLAST parameters

blastp <db> <query> hitdist=40 wordmask=seg postsw

9.2.1.4 Expected results

If you don't find any database hits, your query sequence may correspond to a novel protein. On the other hand, it may be that the parameters of the search are obscuring the similarity. If your query is very short, it may be difficult for it to achieve statistical significance. In this case, first try raising E. However, this step alone may not be enough, and you may have to change to a scoring matrix with a higher value of H (bits per aligned letter), such as BLOSUM80.

If you want to find remote homologies with short query sequences, be prepared for many false-positive alignments. If your sequence has a long, low-complexity region, be sure to have soft masking turned on. It's difficult to find collagens, for example, if complexity filters are destroying most of the alignment. Finally, try the slower, more sensitive search described later.

If you find that you have hundreds of database hits, you may be overrunning the output reporting parameters (-b and -v in NCBI-BLAST and V and B in WU-BLAST). If this is a concern, simply increase these values. However, if you're interested in only the top hits, you can either set E higher or use a search strategy designed for more similar sequences (below).

9.2.1.5 Optimizations and variations

The two protocols below offer speed-sensitivity tradeoffs. For more subtle changes, try altering T. If you use WU-BLAST, set W=4 and scale up T appropriately.

9.2.2 Fast, Insensitive Search

Increased speed is one reason to use an insensitive search. This is particularly true when performing multiple searches. Another reason is to increase the information content in the alignments, which is helpful for short query sequences whose alignments might otherwise fall below the significance threshold. As a rule, the insensitive search shouldn't be used for sequences that are expected to have less than 50 percent identity.

9.2.2.1 Approach

A simple way to make BLASTP faster is to ignore neighborhood words and require that seeds be formed from identical words. Because the sequences are expected to be very similar, we choose the BLOSUM80 scoring matrix and set a low value for E. The proper value for E depends on the query length and database size, so treat the value given next as a starting point.

9.2.2.2 NCBI-BLAST parameters

blastall -p blastp -d <db> -i <query> -F "m S" -f 999 -M BLOSUM80 -G 9 -E 2 -e 1e-5

9.2.2.3 WU-BLAST parameters

blastp <db> <query> wordmask=seg W=3 T=999 matrix=BLOSUM80 Q=11 R=2 postsw E=1e-5

9.2.2.4 Expected results

Most of your alignments should have high percent identities. You will find some that dip to 30 percent, but this doesn't ensure that you can find such alignments in general. With such insensitive parameters, it is unlikely that you will overrun the output cutoffs, but it's worth checking anyway. Set -v and -b higher (V and B in WU-BLAST), or decrease E as you see fit.

9.2.2.5 Optimizations and variations

You can't make the NCBI-BLAST search much faster than it is because the parameters are already near optimum. Setting the two-hit distance lower gives a minute increase in speed that isn't worth the loss in sensitivity. You can play around with the WU-BLAST seeding parameters by changing W, T, and hitdist.

9.2.3 Slow, Sensitive Search

If you're having a hard time finding sequences similar to your query or if you're looking for distant relatives, you may have more success with sensitive parameters.

9.2.3.1 Approach

We recommend lowering T and choosing a scoring matrix designed for greater divergence. The NCBI-BLAST and WU-BLAST scoring parameters are slightly different because they don't have the same built-in estimates for lambda. As usual, we suggest soft masking to align the low-complexity sequence properly; here it's particularly important because we want to make sure that all positive scores are counted. When searching for remote similarities, some real signals can have very low scores. For this reason, even though it will make the report longer, we set E higher. For the same reason, we increase the output reporting parameters.

9.2.3.2 NCBI-BLAST parameters

blastall -p blastp -d <db> -i <query> -f 9 -F "m S" -M BLOSUM45 -e 100 -b 10000 -v

10000

9.2.3.3 WU-BLAST parameters

blastp <db> <query> T=9 wordmask=seg hitdist=60 matrix=BLOSUM50 Q=13 R=1 E=100

B=10000 V=10000

9.2.3.4 Expected results

Whenever you increase sensitivity, expect a decrease in specificity. These parameters are very sensitive, so many of the alignments may be chance similarities and of no biological significance. On the other hand, some biological signals aren't modeled well by BLAST statistics and what may appear as a very low score may be of real interest. Reading a BLAST report containing thousands of alignments isn't always entertaining, so if you're looking for something specific, such as an alignment to a particular region, you may be able to automate the reading with a BLAST parser.

9.2.3.5 Optimizations and variations

The probability model of BLAST assumes that amino acid pairings are independent of their neighbors. But some domains have characteristic signatures. So if your protein belongs to a family of related proteins, you may be able to find more distant relatives by choosing an algorithm with a position-specific scoring matrix, such as PSI-BLAST or HMMER. However, if your query is a novel protein, the best you can do is make your search parameters more sensitive.

To increase sensitivity even more, turn off the two-hit algorithm. In NCBI-BLAST, set -P 1 and in WU-BLAST, remove hitdist=60.

9.3 BLASTX Protocols

BLASTX is generally used to find protein coding genes in genomic DNA or to identify proteins encoded by transcripts. BLASTX runs relatively slowly, and can be the bottleneck in a sequence annotation pipeline. Most BLASTX searches are of the exploring variety. However, it is sometimes necessary to identify nearly identical sequences quickly. The last protocol gives some advice.

9.3.1 Gene Finding in Genomic DNA

Most proteins are related to other proteins. This makes BLASTX a very powerful gene-finding tool. As protein databases become larger and more diverse, BLASTX becomes even more useful because it can identify more and more genes.

9.3.1.1 Approach

As with any search involving genomic DNA, the sequence must have its repeats masked, and lowercase is preferred to Ns. If possible, low-complexity sequences shouldn't be masked with Ns to avoid terminating extension if a coding region contains a region of low complexity.

Since we want to capture a range of protein similarities, we'll use the default BLOSUM62 scoring matrix, which is a good all-around matrix. We increase the threshold score from its default of 12 to 14, which increases the speed more than twofold and is still quite sensitive.

We use a higher value for E because we don't want to miss low-scoring alignments that may be real genes. We set the output report options very high so that no matches are missed simply by truncation. Some protein families have many members and may fill up a BLASTX report by themselves.

For WU-BLAST, we offer two command lines. The first is similar to the NCBI parameter set, and the second uses a single large word rather than two small words. In our tests, the second set is slightly faster and more sensitive.

9.3.1.2 NCBI-BLAST parameters

blastall -p blastx -d <db> -i <genomic> -F "m S" -U -f 14 -b 10000 -v 10000 -e 100

9.3.1.3 WU-BLAST parameters

blastx <db> <genomic> wordmask=seg lcmask hitdist=40 T=14 B=10000 V=10000 E=100

blastx <db> <genomic> wordmask=seg lcmask W=4 T=20 B=10000 V=10000 E=100

9.3.1.4 Expected results

Percent identity is often a good indicator of the reliability of a protein match. Lengthy alignments above 50 percent identity don't occur by chance unless the sequence has some compositional bias. Alignments below 35 percent identity should be met with skepticism, and those below 30 percent aren't very reliable.

The ends of alignments often overrun exon boundaries. If the extensions are too long, they may force an exon into a separate alignment group. If the exon is short, it may not be able to achieve statistical significance when separated. To counteract this phenomenon, you can set the X parameters lower to prevent the extensions from going too far, but this can destroy weak alignments. Another option in WU-BLAST is to use olf and olmax (and their gapped counterparts golf and golmax) to change the overlap rules. These solutions aren't foolproof, so the best approach is to realize that this scenario could happen, and be on the lookout for gaps in coverage.

If your report is very long, you may have some unmasked repeats or low-complexity regions. A graphical report (Appendix D) can help determine where such repeats occur. Look for regions in which the alignment depth is very high. You may have to mask them by hand.

A less obvious problem is associated with GC-rich regions of DNA. These regions tend to have long open reading frames and can match various proteins. If the alignment threshold (Chapter 5) is low, these compositionally biased alignments may be combined to give significant scores. In WU-BLAST, you can set S2 higher, limit alignment groups with topcomboN, or ignore alignment groups with kap. A simpler and more general solution is to raise E.

All gene-finding procedures can be tricked by pseudogenes, and BLASTX is no exception. As discussed in the beginning of the chapter, you can't give stop codons highly negative scores to prevent them from appearing in alignments, that is, unless you use ungapped alignments, which are less sensitive. Rather than trying to remove pseudogenes, try to recognize them quickly. Most pseudogenes have internal stop codons or frame-shifts. While stop codons are easy to find because they appear as a * in an alignment, frame-shifts may not be immediately apparent. Usually, frame-shifts appear as overlapping HSPs in different frames, but they may also appear as HSPs that are very close rather than overlapping. Introns are rarely less than 30 nt, so any HSPs that are closer may result from a frame-shift. The presence of repeats overlapping or abutting an HSP is also a good indicator of a pseudogene. Retro-pseudogenes are derived from mRNAs, so look for poly-A tails or a lack of introns (you may have to do a little research to find out if other versions of the putative gene normally have introns). Also, retro-pseudogenes usually correspond to highly transcribed, conserved genes such as the protein components of the ribosome.

9.3.1.5 Optimizations and variations

Chapter 12 discusses the serial search strategy as a way to vastly improve the speed of translating BLAST searches without losing sensitivity. It's really the best way to run BLASTX. The protocol here and the serial strategy assume that the proteins are reasonably similar. But what if you want to look for remote coding similarities? You can first try lowering T, which will make the search take longer. However, if you're primarily interested in only a short region, lowering T can identify distant relatives. See Section 9.2 section for sensitive searches for appropriate parameters.

9.3.2 Annotating ESTs (and Shotgun Sequence)

Given a collection of ESTs, one of the first analysis tasks is to determine what proteins they encode. The usual procedure is to find the best protein match. Before annotation, the EST may have an identifier such as:

>my_EST_001

At the end, the EST may be annotated with a FASTA definition line like this:

>my_EST_001 similar to Homo sapiens GATA transcription factor

While this definition is useful for classifying transcripts, it leads to a transitive annotation problem in which the similarity is eventually misapplied. You should use this procedure with discretion, and please don't submit such descriptions to public databases without good reason. Finally, whatever you do, don't concatenate multiple FASTA definition lines for the annotation because it can become confusing later on.

Many genome sequencing projects begin with a survey of random, whole genome shotgun reads. This protocol also works for shotgun genomic sequence.

9.3.2.1 Approach

It's a good idea to check your sequences for vector and other contaminants before you begin this search, and if your sequences are genomic in origin, mask repeats as well. ESTs may also contain repeats, but for genomes that have short 3´UTRs and few repeats, this isn't necessary.

This task is much easier to accomplish with a local BLAST installation because you probably have many sequences to classify. If your FASTA file contains multiple sequences, BLAST conveniently processes each one in turn and its output will contain one report for each sequence.

The following alignment parameters are slightly less sensitive than the default parameters and are a good compromise for speed and sensitivity. Since we're really only interested in the description of the top match above some threshold, we set minimal output report parameters and E to 1e-10 (a somewhat arbitrary, but low value to prevent misclassification). Since soft masking can sometimes allow a region of low-complexity to dominate an alignment score, and since we won't look at each alignment, we err on the side of making no inference rather than making the wrong inference. We therefore use ordinary complexity filtering. In keeping with this philosophy, our choice of alignment parameters favors specificity and speed over sensitivity. For WU-BLAST, we offer two command lines. The second is slightly faster and more sensitive.

9.3.2.2 NCBI-BLAST parameters

blastall -p blastx -d <db> -i <ESTs> -U -f 14 -e 1e-10 -b 0 -v 1

9.3.2.3 WU-BLAST parameters

blastx <db> <ESTs> filter=seg lcfilter hitdist=40 T=14 E=1e-10 B=0 V=1

blastx <db> <ESTs> filter=seg lcfilter W=4 T=20 E=1e-10 B=0 V=1

9.3.2.4 Expected results

Depending on the source of the sequence, there may not be many matches. With a low E value, most matches should be real similarities.

9.3.2.5 Optimizations and variations

This is a task that greatly facilitated by the Bioperl tools (see www.bioperl.org). Using them simplifies running the BLAST job, parsing the output, editing sequence descriptions, and rewriting the files in FASTA format. BLAST parsers also let you apply other useful criteria for assignment—for example, requiring 40 percent identity. Lacking familiarity with Bioperl, you can pipe your report through common Unix tools. For example, you can capture the first line of the query and its best match with the following grep:

blastall -p blastx ... | grep "^Query=\|^>"

Unlike the previous protocol, serial searching isn't expected to help much because the first and second searches will find the same alignment.

9.3.3 Super-Fast BLASTX

At times you'll need to quickly map between a genomic sequence and its encoded proteins. This protocol represents the fastest way to find nearly identical matches with BLASTX.

9.3.3.1 Approach

Our general approach is to set the seeding, extension, and evaluation parameters as insensitively as possible, within reason; we still want to be able to find matches to sequences with allelic differences. We don't bother changing the scoring matrix; it doesn't impact the speed of the search.

The NCBI-BLAST version of BLASTX isn't the best program for quickly finding protein matches to DNA because its maximum word size is 3. The parameters that follow use this size and a minimum distance for the two-hit algorithm (4) for an effective word size of 6 (yes, 4 is a strange way of specifying zero distance between two three-letter words). Oddly, the search runs faster when using maximum gap penalties rather than specifying no gaps with -g F. We don't set lower values for because in practice, it has almost no impact on speed.

The WU-BLAST search uses a large word size without a neighborhood (automatically turned off at word sizes of 5 and above). The WINK parameter greatly reduces the total number of words and is one of the keys to the speed of this search. With W=6 WINK=6, the effective word size is 12. With such stringent seeding parameters, there is no need to change X or the alignment thresholds (see Chapter 5).

We also include a command line for the classic WU-BLAST 1.4, which works quite well for this task. (The 1.4 version is nearly identical to the 1.4 version of NCBI-BLAST that is no longer available). With this program, a word size of 5 performs best. Lowering X to terminate extensions early and increasing the alignment threshold S2 to reduce the computational burden of combined scores are both useful here. The values for X and S2 are calculated based on the default BLOSUM62 matrix. The most negative score is -4, so X=10 allows at least two mismatches. The average match score is about 5, and the average mismatch score is about -1.5 (you can estimate these scores quickly by looking at common amino acids such as alanine and glycine). S2=65 corresponds to a 90 percent identity alignment of 15 amino acids.

9.3.3.2 NCBI-BLAST parameters

blastall -p blastx -d <db> -i <dna> -f 999 -A 4 -G 32767 -E 32767

9.3.3.3 WU-BLAST parameters

blastx <db> <dna> filter=seg W=6 WINK=6 nogap

9.3.3.4 WU-BLAST 1.4 parameters

blastx <db> <dna> filter=seg W=5 S2=65 X=10

9.3.3.5 Expected results

Not all exons will be hit using such parameters, but with this mapping experiment, the general coordinates are of greatest interest. If you need an accurate alignment, this search can be followed by a more sensitive search using a serial strategy or bl2seq.

How much faster are these searches? Speed depends on sequence length and content, but as a general observation for sequences in the 50-200 Kb range, if the default NCBI parameter set is considered 1x, the fastest NCBI-BLAST is 6-8x, WU-BLAST is 100-500x, and the classic WU-BLAST 1.4 is 50-150x.

9.3.3.6 Optimizations and variations

Other alignment algorithms such as BLAT (yes, the name is nearly identical), index the database as well as the query. They may prove faster, but they may also have larger resource requirements.

9.4 TBLASTN Protocols

TBLASTN and BLASTX are very similar in that one sequence is protein and the other is nucleotide. But their usage is different. TBLASTN commonly maps a protein to a genome or searches EST databases for related proteins not yet in the protein databases.

9.4.1 Mapping a Protein to a Genome

Many avenues of investigation focus on a specific protein—for example, medical research on a genetic disease. For many proteins, there exist several closely related homologs, and understanding the role of a particular protein often means studying the near neighbors because they sometimes have interesting properties. The genomic environment of an encoded protein is often of great interest because the genomic sequence contains regulatory elements that determine where and when proteins are expressed. So, in addition to the typical BLASTP search for homologous proteins, it is also useful to do a TBLASTN search against your favorite genomes.

9.4.1.1 Approach

Even though this is conceptually a mapping experiment, we don't choose extremely insensitive parameters because we also want to identify closely related proteins that may be of interest. The seeding parameters, which require two matching words in a 40 aa window, capture a surprising amount of variability. We'll provide an additional WU-BLAST command line that uses a single, large neighborhood. It's both faster and more sensitive than the two-hit version but takes substantially more memory. We set E to a low value to cut down on the number of low scoring hits that may be prevalent in a large search space, but if your query is especially small, this value should be increased.

9.4.1.2 NCBI-BLAST parameters

blastall -p tblastn -d <genome> -i <protein> -f 999 -e 1e-5

9.4.1.3 WU-BLAST parameters

tblastn <genome> <protein> filter=seg T=999 E=1e-5

tblastn <genome> <protein> filter=seg W=5 T=25 E=1e-5

9.4.1.4 Expected results

If all goes well, you'll find your gene in the genome. You may also find several related proteins. If the genome is small, you may not find more than one, but if your source is larger, and more complex, you may find several copies. Genomic sequences in BLAST databases are sometimes not masked, so if your search takes a long time to complete, or if you find hundreds of similar genes, you may be hitting a repeat.

Some of the hits may be to pseudogenes. High stop codon penalties with ungapped extension will not remove all pseudogenes, so in addition to inspecting alignments for the presence of stop codons, also look for overlapping HSPs (from frame shifts) and single HSPs (when multiple exons are expected). Nearby repeats and poly-A tails in the genomic sequence are other useful indicators.

9.4.1.5 Optimizations and variations

We recommend using the serial search strategy described in Chapter 12 for all translating BLAST searches that employ long sequences. If you can't do this automatically, you can follow up each of the hits found here with bl2seq.

For more sensitivity, reduce the value of T to allow neighborhood words. For a less sensitivity and a lot more speed, W=5 T=999 is a useful WU-BLAST setting. It also has the added benefit of using much less memory than W=5 T=25. If you need to do a quick lookup and are only interested in identical matches, you can adapt the protocol found in Section 9.3.3 to TBLASTN.

One way to speed up this procedure is to start with a BLASTP search to identify similar proteins and then follow up each hit in its own genome with near-identity parameters. One disadvantage to this strategy is that you will have more searches to perform and a lot of sequence handling. The assumption that the protein database you're using for your BLASTP search contains all of the genome's genes is also problematic. It's much safer to assume that not all genes have been found and use TBLASTN for your search.

9.4.2 Mining ESTs (and Shotgun DNA) for Protein Similarities

Since ESTs contain fragmentary information and are often unannotated, proteins encoded in ESTs may not appear in protein databases for a while. Therefore, if you're looking for relatives of your favorite protein, search a comprehensive EST database with TBLASTN, in addition to a typical BLASTP search. You can also use this protocol to search shotgun genomic sequence.

9.4.2.1 Approach

In choosing our alignment parameters, we need to balance sensitivity and speed. We want to be able to identify a range of similar sequences, so we use the default scoring matrix and gap penalties. At the same time, EST databases (and especially shotgun genomic databases) can be quite large, so we use slightly insensitive seeding parameters.

For WU-BLAST, we include four command lines. Number 1 is approximately the same as the NCBI-BLAST parameters. Relative to the first, number 2 is slightly faster and more sensitive, number 3 is about the same speed but much more sensitive, and number 4 has the same sensitivity but is much faster.

We use the default value for E (10) because some of the EST/shotgun matches may contain only a small portion of coding sequence. We set the output parameters high so we don't miss any alignments by report truncation.

9.4.2.2 NCBI-BLAST parameters

blastall -p tblastn -d <est_db> -i <protein> -F "m S" -f 15 -b 10000 -v 10000

9.4.2.3 WU-BLAST parameters

tblastn <est_db> <protein> wordmask=seg W=3 T=15 hitdist=40 B=10000 V=10000

tblastn <est_db> <protein> wordmask=seg W=4 T=16 hitdist=40 B=10000 V=10000

tblastn <est_db> <protein> wordmask=seg W=4 T=20 B=10000 V=10000

tblastn <est_db> <protein> wordmask=seg W=4 T=99 B=10000 V=10000

9.4.2.4 Expected results

Our seeding parameters are on the insensitive side, so if you don't find what you're looking for, the first parameter to change is T (-f in NCBI-BLAST). Drop it by one or two points at a time because the search takes longer with each decrement.

Sequencing errors, especially insertions and deletions, may terminate extension. This can lead to multiple HSPs or possibly the loss of smaller HSPs. Check the coordinates of the alignments, and if large regions are missing, they may correspond to out-of-frame coding sequences. They may also be UTRs in a transcript or introns in genomic sequence. Reducing the search space to increase sensitivity enables you to recover shorter HSPs; bl2seq is convenient for this task.

9.4.2.5 Optimizations and variations

If you're looking for near identities, you can make this search much faster. See Section 9.3.3 for parameters. Because the query and database sequences are all short, you can't optimize this search with a serial strategy.

9.5 TBLASTX Protocols

As discussed in Chapter 2, coding sequences evolve slowly compared to surrounding DNA. This makes TBLASTX a powerful gene-prediction tool for genomes that are appropriately diverged. What is the proper evolutionary distance? Because genes and organisms change at varying rates, there is no simple answer like "100 million years." If the distance is too great, the similarities may no longer be visible, but if the distance is too small, sequence similarity loses discriminatory power. For example, there is little sense in performing TBLASTX searches between humans and E. coli or humans and chimpanzees.

Historically, TBLASTX has not been as popular as the other BLAST programs for several reasons. First, TBLASTX is computationally intensive. Second, until recently, there were not many completely sequenced genomes. Third, when you get a match, you will rarely find a useful description for what was found—just an alignment between two potential coding sequences. As more genomes are sequenced and computer performance continues to rise, TBLASTX will become more useful.

9.5.1 Preventing Stop Codons

The scoring matrices distributed with the BLAST programs give positive scores for aligning stop codons to one another. This is unacceptable for discriminating between coding and noncoding regions. Chapter 10 covers installation of BLAST software and describes how to create derivative scoring matrices with highly negative stop codon scores. If you don't have permission to make these changes, you can create the derivatives in your home directory. In this case, you need to specify the explicit path to your matrix rather than use just the name. WU-BLAST operates a little differently, and it is more convenient to specify alternate scores on the command line. Each protocol gives an example of this. As discussed in Chapter 8, gapped alignment can skip over stop codons. For this reason, consider using ungapped alignment for your TBLASTX searches.

9.5.2 Finding Undocumented Genes in Genomic DNA

Gene prediction is difficult. There are no genomes for which all protein coding genes are completely known. One of the most highly investigated genomes is the human genome, but despite what you read in the news, the number of genes can't be stated with much confidence. Counting genes is easier than determining their exact structure, and as a result, there are many proteins for which the true sequence is in doubt. Many genes are still waiting to be discovered (and many documented genes aren't real genes).

9.5.2.1 Approach

TBLASTX is computationally expensive because it translates both strands of the query and database sequences in three frames on each strand. To make matters worse, the sequences and databases searched by TBLASTX tend to be large. To counteract these factors we choose insensitive seeding parameters, which is appropriate, considering that the extension algorithm is gapless and therefore also less sensitive.

Like any other search employing genomic DNA, it is always a good idea to mask repeats first. Here we prefer hard masking instead of soft-masking and normal complexity filtering. Our reasoning is that low-complexity sequence is common in genomic DNA and random word hits near low-complexity sequence may result in lengthy extensions, high alignment scores, and misleading statistical significance.

For WU-BLAST, we offer two command lines. The second, which uses a single, large word rather than two small words, is faster and more sensitive, but requires more memory. It also shows how to change scoring matrix values from the command line with the altscore parameter.

9.5.2.2 NCBI-BLAST

blastall -p tblastx -d <db> -i <genomic> -f 999

9.5.2.3 WU-BLAST

tblastx <db> <genomic> filter=seg W=3 T=999 hitdist=40 nogap

tblastx <db> <genomic> filter=seg W=5 T=25 nogap altscore="* any -999" altscore="any * -999"

9.5.2.4 Expected results

This protocol can be used with either genomic or EST databases. However, searching EST databases with BLASTN is usually better. This discussion focuses on genome-genome searches. For genome-EST results and interpretations, see the appropriate BLASTN protocols.

Interpreting TBLASTX alignments isn't straightforward. It's nearly impossible to look at a report full of alignments and determine gene boundaries or the exact coordinates of coding exons. TBLASTX offers testable hypotheses. Regions with strong coding similarities may or may not correspond to real genes, but they are good candidates for experimental biology. Here are a few reasons why TBLASTX might find something missed by other approaches:

Genes in genes

Most gene-prediction algorithms don't predict genes within genes. However, the fact that large introns contain genes on their opposite strand is a relatively frequent phenomenon. TBLASTX can help identify these genes because the algorithm looks for local alignment similarities and has no bias for overall gene structure.

Alternative splicing

Some genes have several alternatively spliced forms. This is especially common in certain genomes, such as mammalian ones. Gene prediction algorithms usually find a single, optimal gene structure and no alternate forms. Because TBLASTX has no knowledge of splice sites, this doesn't pose a problem. Spliced variants may also have narrow windows of expression, which makes them difficult to find when using cDNA approaches. TBLASTX is less prone to missing these exons unless they are highly diverged.

Low expression

Genes expressed at low levels may have odd codon usage, which makes them less visible to gene prediction algorithms. Because the transcripts are rare, they are also less likely to appear in cDNA libraries and EST databases. TBLASTX isn't affected by codon biases or expression levels

9.5.2.5 Optimizations and variations

This experiment is much more efficiently run as a serial search. In this strategy, a preliminary, insensitive search identifies sequences that are similar, and a second, sensitive search produces the alignments. Chapter 12 discusses this approach. You can try the approach by running bl2seq on each sequence identified in the search.

9.5.3 Transcript-Transcript TBLASTX

When presented with a transcript of unknown function, you should first implement a BLASTX search to determine if such a transcript corresponds to a known protein. But what if it doesn't? One reason why it might not show similarities is because its encoded protein isn't yet in the protein database. It's also possible that the transcript doesn't encode a protein. If it does, though, it might have some undiscovered relatives, and the best place to look for such entities is in EST databases.

9.5.3.1 Approach

We use the same seeding parameters for the same reasons given in the previous section, Section 9.5.2. We employ soft-masking here, however, because the query sequence is short and probably doesn't contain as much of a low-complexity sequence.

9.5.3.2 NCBI-BLAST

blastall -p tblastx -d <est_db> -i <transcript> -f 999 -F "m S"

9.5.3.3 WU-BLAST

tblastx <est_db> <transcript> wordmask=seg W=3 T=999 hitdist=40 nogap

tblastx <est_db> <transcript> wordmask=seg W=5 T=25 nogap altscore="* any -999" altscore="any

* -999"

9.5.3.4 Expected results

Evolutionary distance between the sequences is the key to determining if an alignment really corresponds to a protein. If the sequences are derived from closely related species, similarity isn't much help. In such cases, try aligning the matches with bl2seq without extreme stop codon penalties. If you get alignments that cross stop codons, the sequences aren't diverged enough.

If the sequences come from distant species, the alignments correspond to coding regions. This interpretation is even more believable if there are matches from multiple species.

9.5.3.5 Optimizations and variations

If the sequence has a long open reading frame, it is more efficient to translate this frame first and then search with TBLASTN. Recognizing the true reading frame, though, isn't always easy (see Chapter 8). If in doubt, use this protocol.