BLAST (Basic Local Alignment Search Tool)

Chapter 6. Anatomy of a BLAST Report

This chapter explores the standard BLAST format. NCBI-BLAST and WU-BLAST are very similar, and their few important differences are described below. NCBI-BLAST offers several additional output options described in Appendix A.

6.1 Basic Structure

A standard BLAST report has four parts (see Figure 6-1).

Figure 6-1. The four parts of a standard BLAST report

figs/blst_0601.gif

Header

The first line contains the name of the program, its version, and its build date. If BLAST crashes or has some kind of unexpected behavior, include this information when you report the problem to the authors. The next piece of information is a reference to the scientific literature, which should be cited if you publish research that employed BLAST. The most important information in the header, the names of the query sequence and the database, appear next. The last line is a progress meter that is updated during the search.

One-line summaries

Each line indicates the name of the sequence, the highest scoring alignment found and the lowest E-value for any HSP or group of HSPs. The one-line summaries are often hyperlinked to the alignments farther below when the output comes from a web page. If you just want to know for example, the names of the top matches, these one-line summaries are convenient.

Alignments

The alignments usually make up the bulk of the report. Figure 6-1 shows only one alignment. The following section discusses alignments in greater detail and gives examples from the five standard BLAST programs.

Footer

The footer reports search parameters and various other statistics. The most important features are the word size (W), neighborhood word threshold score (T), Expect (E), and the scoring scheme (scoring matrix or match/mismatch values and the gap costs) because these factors control the sensitivity and specificity of a search. The footer labels these values clearly.

6.2 Alignments

The alignments and alignment statistics reported by BLAST differ slightly from program to program. The rest of this chapter describes the details of BLASTP, BLASTN, BLASTX, TBLASTN, and TBLASTX alignments and shows how to recognize alignment groups.

6.2.1 BLASTP

BLASTP alignments are the simplest to understand. Figure 6-2 shows the anatomy of a typical BLASTP alignment.

Figure 6-2. A BLASTP alignment

figs/blst_0602.gif

Here are the parts you should pay attention to:

Score

This value is computed from the scoring matrix and gap penalties. A higher score indicates greater similarity. The raw score is shown without units, and the normalized score is followed by "bits."

Database sequence

The complete FASTA definition line is reported here along with the length of the sequence. All the alignments between the query and a specific database sequence are collectively called a hit. The database in Figure 6-2 has one alignment.

Expect

The number of alignments expected at random given the size of the search space, the scoring matrix, and the gap penalties. The lower the E-value, the less likely this is a random similarity.

Statistics lines

Score, E-value, and percent identity always appear here. Depending on the program, percent positive scoring, P-value, group, gaps, strand, and reading frame may also be reported.

Coordinates

The coordinates of each sequence are indicated at the beginning and ending of each line. The single alignment in Figure 6-2 is long enough that it is reported on three separate lines.

Alignment line

Letters that are identical between two sequences are reported here. Those that have positive scores in the scoring matrix are displayed with a plus sign. Gaps and nonpositive scores are blank.

Query and Sbjct

The query sequence is always listed first. The database sequence is abbreviated as Sbjct (subject).

The database sequence may be several lines long if the BLAST database is a nonredundant database with concatenated definition lines. For more on this topic, see Chapter 11. The WU-BLAST format differs slightly from the NCBI format: gaps aren't reported on the statistics line, and the P-value (displayed as P or Sum P) is always reported in addition to the Expect.

6.2.2 BLASTN

DNA is a double-stranded molecule, and genes may occur on either strand. This fact makes BLASTN alignments a little more difficult to interpret than BLASTP alignments. When a query sequence is searched against a database, both strands of the query are examined. The plus strand is the sequence in the FASTA file. The minus strand is the reverse complement of this sequence. If the similarity between the query and subject sequences is on the same strand, both sequences are labeled as being on the plus strand and the coordinates increase from left to right (Figure 6-3a). Since BLAST just aligns letters and has no model of genes or other features, it is impossible to determine on which strand a gene lies from a BLASTN alignment. Even if an alignment is labeled as "Plus/Plus," the encoded gene may be on the minus strand.

When the minus strand of the query sequence is similar to a database sequence, the alignment is reported with either the subject or query sequence in reversed coordinates. In NCBI-BLAST, the database sequences are flipped (Figure 6-3b), but in WU-BLAST, the query coordinates are flipped (Figure 6-3c).

Figure 6-3. BLASTN alignments: (a) NCBI-BLAST, same strand; (b) NCBI-BLAST, different strand; (c) WU-BLAST, different strand

figs/blst_0603.gif

Table 6-1 shows how strand is displayed in the five standard BLAST programs.

Table 6-1. Strandedness

Program

Plus / Plus

Plus / Minus

Minus / Plus

Minus / Minus

BLASTP

Always

Never, proteins don't have strand

   

BLASTN

Same strand

NCBI-BLAST flips the subject sequence

WU-BLAST flips the query sequence

Never

BLASTX

The query sequence is labeled as Frame +1, +2, +3

Never

The query sequence is labeled as Frame -1, -2, -3

Never

TBLASTN

The subject sequence is labeled as Frame +1, +2, +3

The subject sequence is labeled as Frame -1, -2, -3

Never

Never

TBLASTX

Any combination of positive or negative frames for either the query or subject sequence.

     

Here are a few minor notes:

· Both NCBI-BLAST and WU-BLAST change the alignment format for BLASTN to represent matches as vertical bars. Because match/mismatch scoring is used, positive scoring mismatches are not displayed.

· NCBI-BLAST displays nucleotide sequences in lowercase, whereas WU-BLAST displays them in uppercase.

6.2.3 BLASTX

Alignments from BLASTX are complicated by both strand and reading frame. The query sequence is translated in three frames on both the plus and minus strands. Chapter 2 discusses the reading frame in more detail. With three nucleotides per codon, the coordinates of the query sequence increase by threes (Figure 6-4a). On the plus strand, the reading frame is computed relative to the start of the plus strand; reading frame 1 starts at position 1 and reading frame 2 starts at position 2. On the minus strand, the reading frame is calculated relative to the reverse complement of the plus strand; the last letter of the FASTA file starts frame -1 and the second-to-last letter starts frame -2. Minus strand matches invert the query coordinates (Figure 6-4b).

Figure 6-4. BLASTX alignments (ovals indicate that nucleotide coordinates increase by threes (a) and are reversed for minus strand matches (b))

figs/blst_0604.gif

6.2.4 TBLASTN

TBLASTN alignments are very similar to BLASTX alignments, except that the database and query are exchanged. Therefore, the database sequence increases in threes, and the database sequence has flipped coordinates on the minus strand.

6.2.5 TBLASTX

TBLASTX has more complicated alignments because both the query and the database have strand and frame. Figure 6-5 shows examples of all strand combinations. One of the most confusing aspects of TBLASTX alignments is that a number of different frames may represent the same region from both the query and subject. A TBLASTX alignment between two genomic sequences often highlights shared coding sequences. However, the correct frame of the encoded proteins can't be determined from a TBLASTX report. Chapter 8 and Chapter 9 discuss techniques that make TBLASTX more discriminate.

Figure 6-5. TBLASTX alignments (coordinates increase by threes and may have any combination of frames)

figs/blst_0605.gif

6.2.6 Alignment Groups

Alignment groups are one of the most confusing aspects of the BLAST report. Chapter 4 and Chapter 5 discuss how and why alignments are sometimes grouped to increase their statistical significance. However, the standard BLAST format doesn't make this structure easy to see. Figure 6-6 shows the scores reported for various alignments in a single database hit. The groups can be inferred from the Expect values. If several alignments have the same E-value, it is more difficult to determine which alignments belong to which groups.

Figure 6-6. Alignment groups (groups can be inferred from Expect values)

figs/blst_0606.gif

By default, WU-BLAST alignment groups are just as difficult to recognize as NCBI-BLAST groups. WU-BLAST has a very useful command-line option called topcomboN that organizes and limits the number of groups. Chapter 8 discusses topcomboN in more detail. Figure 6-7 shows how groups are organized by strand and then by Sum P-value for a single database hit. Groups are labeled and need not be inferred. Notice that some groups contain only one alignment.

Figure 6-7. WU-BLAST alignment groups with topcomboN=9

figs/blst_0607.gif