BLAST (Basic Local Alignment Search Tool)

Chapter 13. NCBI-BLAST Reference

This chapter describes the parameters and options for the NCBI suite of BLAST programs. The NCBI distribution includes the blastall program, plus several ancillary programs that are either necessary for blastall (e.g., formatdb) or provide other BLAST-like searches that aren't included within blastall (e.g., blastpgp and blastclust). This reference also describes the various command-line parameters for the most important executables.

13.1 Usage Statements

If you forget the syntax for a particular parameter, you can view a usage statement from most programs by typing the program name followed by a dash (in some cases the dash isn't required, but it's easier to remember to use a dash with all programs). For example:

blastall -

formatdb -

fastacmd -

megablast -

bl2seq -

blastpgp -

blastclust -

13.2 Command-Line Syntax

All parameters for NCBI-BLAST programs are single letters and must be preceded by a single dash. Unlike many common Unix programs, the parameters for NCBI programs are never concatenated. All parameters may take arguments, including those that operate as true/false (T/F) switches. For such switches, the T/F is case-insensitive, and the argument may be omitted, in which case the switch is set to T. Finally, the space between the parameter and the argument is optional. The following commands are all identical.

formatdb -i db -o T -V t

formatdb -i db -o -V

formatdb -idb -ot -VT

formatdb -idb -o -V

The following command, however, is illegal because it tries to set -o to a value of V.

formatdb -idb -oV

13.3 blastall Parameters

blastall is controlled by several parameters. Many of the parameters have default settings and don't need to be explicitly assigned. Consider this simple command:

blastall -p blastp

Behind the scenes, this command is converted to:

blastall -p blastp -d nr -i stdin -e 10 -m 0 -o stdout -F T -G 11 -E 2 -X 15 -v 500

-b 250 -f 11 -g T -a 1 -M BLOSUM62 -W 3 -z 0 -K 0 -Y 0 -T F -U F -y 0.0 -Z 0 -A 40

You can see that many parameters are set without your express knowledge. These parameters affect the results of your experiment and, as reinforced many times throughout the book, you should try to understand these parameters and set them to fit each experiment.

The following reference section explains all the parameters available for blastall and lists the default values that are used if not explicitly set. The table was compiled according to the default values for the five basic programs. Although megablast can be run from within blastall (-n T), you should use the standalone program. The parameters for it are presented later in the chapter.

-a [integer]

 

Default: 1

Programs: All

Sets the number of processors to use on of processors. If you have multiple queries, you will get better throughput by executing multiple BLAST searches. For insensitive searches such as default BLASTN, setting -a to a higher value may not appreciably improve speed if disk I/O is the bottleneck.

 

-A [integer]

 

Default: blastn 0, others 40

Programs: All

Sets the multiple-hit window size. When BLAST is set to two-hit mode, this option requires two word hits on the same diagonal to be within [integer] letters of each other in order to extend from either one. The larger the [integer], the more sensitive BLAST will be. Setting [integer] to 0 sets the default behavior of 40, except for blastn, whose default is single word hit. To specify one-hit behavior, set -P 1.

 

-b [integer]

 

Default: 250

Programs: All

Truncates the report to [integer] number of alignments. There is no warning when you exceed this limit, so it's generally a good idea to set [integer] very high unless you're interested only in the top hits.

 

-B [integer]

 

Default: Optional

Programs: blastn, tblastn

Sets the number of queries to concatenate in a single search. Concatenating queries accelerates the search because the database is scanned just one time. This is the principle underlying megablast, but the implementation is different in blastall.

This option is new in Version 2.2.6 and still experimental. The specified [integer] must be the number of sequences in the query file. If it's less, only the first set of [integer]sequences is used. Also, the output is very different than you would expect. All the query names are listed, and then all the one-line summaries are given,  followed by the alignments, and finally, one footer is produced for the whole report. Given this format, it's very difficult to discern which alignments belong to which query. This option should not be used in its current implementation.

 

-d [database]

 

Default: nr

Programs: All

Identifies the database to search. [database] must already be formatted by formatdb. BLAST looks for [database] in the following order: the local directory, the BLASTDB environment variable (Unix only), and finally, the location specified in the .ncbirc file.

You can merge multiple databases into a single virtual database by putting the individual databases in quotes. For example, to merge the nt and est databases, use: -d "nt est". You can't mix nucleotide and amino acid databases. The statistics reported are based on the sizes of the combined databases. Virtual databases may exceed file size limits imposed by the operating system.

 

-D [1..23]

 

Default: 1

Programs: tblastn, tblastx

The genetic code to use for translation of the database nucleotide sequence. See http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy for updates.

 

Options

1

Standard Nuclear Genetic Code

2

Vertebrate Mitochondrial

3

Yeast Mitochondrial

4

Mold, Protozoan, and Coelocoel Mitochondrial

5

Invertebrate Mitochondrial

6

Ciliate Nuclear

9

Echinoderm Nuclear

10

Euplotid Nuclear

11

Bacterial and Plant Plastid

12

Alternative yeast nuclear

13

Ascidian Mitochondrial

14

Flatworm Mitochondrial

15

Blepharisma Nuclear

16

Chlorophycean Mitochondrial

21

Trematode Mitochondrial

22

Scenedesmus Obliquus Mitochondrial

23

Thraustochytrium Mitochondrial

-e [real number]

 

Default: 10

Programs: All

Sets the threshold expectation value for keeping alignments. This is the E from the Karlin-Altschul equation that describes how often an alignment with a given score is expected to occur at random.

 

-E [integer]

 

Default: blastn 2, others 1

Programs: All

The penalty for each gap character. The -G parameter controls the initial cost of opening a gap. Note that -E 0 is synonymous with the default behavior and, it's impossible to set -E to zero unless -g F is set, which turns gapping off. The default gap cost, for programs other than blastn, depends on the scoring matrix. The value shown here is for the default BLOSUM62 matrix. See Appendix C for a complete list of default and legal gap penalties.

 

-f [integer]

 

Defaults: blastp 11, blastx 12, tblastn 13, tblastx 13

Programs: blastp, blastx, tblastn, tblastx

Neighborhood word threshold score. Only those words scoring equal to or greater than [integer] will seed alignments.

 

-F [T/F], -F [string]

 

Default: T, but see below

Programs: All

Filters the query sequence for low-complexity subsequences. The default setting is T. Complexity filtering is generally a good idea, but it may break long HSPs into several smaller HSPs due to low-complexity segments. This can cause some alignments to fall below the significance threshold and be lost. To prevent this, either turn off filtering (not recommended) or use soft masking, in which the filter is used only in the word seeding phase, but not the extension phase.

The parameter argument's [string] form follows a nonintuitive syntax. If the string begins with an m, soft masking is turned on. Filtering programs are specified by a single capital letter: D for DUST, R for human repeats, V for vector sequences, S for SEG, and C for coiled-coil. D, R, and V are used only for blastn searches, and S and C are used for all other programs. More than one filter may be specified, and additional parameters may be passed to the programs. See the following tables and the -U parameter used for filtering lowercase letters in the query sequence.

To use R or V, the correct database files must be downloaded and installed in the BLASTDB directory. For human repeats, three databases are needed: humlines.libhumsines.lib, and retrovir.lib. For vector filtering, use the UniVec_Core database (ftp://ftp.ncbi.nih.gov/pub/UniVec/).

 

String options for blastn

Behavior

Parameter format

No complexity filter

-F ""

Default (DUST)

-F "D"

Soft masking

-F "m D"

Lowercase soft masking

-F "m" -U

Soft masking of DUST and lowercase letters

-F "m D" -U

Mask human repeats

-F "R"

Mask vector sequences

-F "V"

Soft-masking of human repeats and vector

-F "m R;V"

String options for blastp, blastx, tblastn, and tblastx

Behavior

Parameter format

No complexity filter

-F ""

Default (SEG)

-F "S"

Soft masking

-F "m S"

Lowercase soft masking

-F "m" -U

Coiled-coil

-F "C"

SEG plus coiled-coil

-F "S;C"

SEG with settings for windowsizelocut, and hicut

-F "S 10 1.0 1.5"

As above, plus coiled coil and soft masking (including lowercase)

-F "m S 10 1.0 1.5; C" -U

-g [T/F]

 

Default: T

Programs: blastn, blastp, blastx, tblastn

Performs gapped alignment. Setting this to F invokes the older, ungapped style of alignment. You can't perform gapped alignments with tblastx, regardless of this setting.

 

-G [integer]

 

Defaults: blastn 5, others 11

Programs: All

Initial penalty for opening a gap of length 0. Penalties for extending the gap is controlled by parameter -E. -G 0 invokes the default behavior, and setting -G to zero is impossible, unless -g F is set, which turns gapping off. The default gap costs for programs other than blastn depend on the scoring matrix; the value here is for the default BLOSUM62 matrix. See Appendix C for a complete list of default and legal gap penalties.

 

-i [input file]

 

Default: stdin

Programs: All

If -i isn't included on the command line, BLAST expects input from stdin (i.e., it will wait indefinitely for you to type in a FASTA file from the keyboard). The following commands are therefore equivalent:

blastall -p blastn -d nt -i query

blastall -p blastn -d nt < query

cat query | blastall -p blastn -d nt

cat query | blastall -p blastn -d nt -i stdin

If the input file contains multiple sequences, BLAST will be run on each sequence in order, and the resulting output will contain concatenated BLAST reports.

 

-I [T/F]

 

Default: F

Programs: All

Shows GenInfo Identifier (GI) numbers in definition lines. A GI is a unique numeric identifier assigned for a sequence in GenBank. A GI corresponds to an accession version pair.

 

-J [T/F]

 

Default: F

Programs: All

Believe the query defline.

 

-K [integer]

 

Default: 0 - Off

Programs: All

The number of best hits from a region to keep. This option is useful when you want to limit the number of alignments that might pile up in one section of the query. This is most useful if the settings of -b or -v are low, and the abundant alignments push lower scoring alignments off the end of the report. If set, a value of 100 is recommended.

 

-l [file]

 

Default: Optional

Programs: All

Restricts database search to a list of GIs found in [file]. The database sequences must have NCBI-compliant identifiers, including GI numbers, and the database must be indexed (by running formatdb with the -o option). The [file] must be in the same directory as the database or in the directory from which blastall is called. [file] may be in text format with one GI per line or in binary format (see the -B parameter for formatdb).

 

-L [string]

 

Default: Optional

Programs: All

The location on query sequence. This lets you limit the search to a subsequence of the query sequence. For example, to search just the letters from 21 to 50, add the following parameter:

-L "21,50"

The alignments won't extend outside the specified region. In older versions of BLAST, -L set the size of the region under control of the -K parameter.

 

-m [0..11]

 

Default: 0

Programs: All

Sets the alignment viewing options. Appendix C gives examples of these display options.

 

Options

0

Pairwise

1

Query-anchored, showing identities, no gaps in query (gaps are shown as a tree-like thing in subjects), identities shown as ".", positives uppercase, negatives lowercase

2

Query-anchored, no identities, no gaps in query, negatives lowercase

3

Flat query-anchored, show identities, padding through all sequences

4

Flat query-anchored, no identities, padding through all sequences

5

Query-anchored, no identities and blunt ends, (dashes [-]are used to blunt the ends)

6

Flat query-anchored, no identities and blunt ends, ([-] to ends)

7

XML output

8

Tabular

9

Tabular with comment lines

10

ASN.1 in text format ([-] must be set for this option to work)

11

ASN.1 in binary format ([-J] must be set for this option to work)

-M [matrix file]

 

Default: BLOSUM62

Programs: All except blastn

Designates a protein similarity matrix. This is used in all BLAST programs except blastn. Matrices are sought in the following order: in the local directory, in the location specified in the .ncbirc file, in a local data directory, and finally, in the BLASTMAT environment variable (only on Unix systems). Other matrices included in the standard distribution include BLOSUM45, BLOSUM80, PAM30, and PAM70.

You can use custom matrix files, but it requires modifying the source code and defining the new matrix with all of its associated statistics for different affine gap combinations and recompiling the binary. Using these custom files isn't recommended because it requires the arduous task of calculating gapped values for lambda and maintaining a derivative branch of the source code.

 

-n [T/F]

 

Default: F

Programs: megablast

Sets the blastn program to the megablast mode, which is optimized to find near identities very quickly. The following lines are equivalent:

blastall -p blastn -n T -d est -i my_file

megablast -d est -i my_file -D 2

More program options are available if you run the megablast executable (see Section 13.6).

 

-o [output file]

 

Default: Optional

Programs: All

Designates an output file for the search results. If not used, output is printed to stdout. The following commands are equivalent:

blastall -p blastn -d nr -i query -o output

blastall -p blastn -d nr -i query > output

 

-p [program name]

 

Default: None, required parameter

Choices: blastn, blastp, blastx, tblastn, tblastx, psitblastn

When choosing psitblastn, the -R [checkpoint file] must also be specified. This special use of blastall uses the output PSSM checkpoint file of PSI-BLAST (see blastpgp -C option), combined with the protein query sequence, to implement a tblastn search against a nucleotide database.

 

-P [0/1]

 

Default: blastn 1, others 0

Programs: All

Specifies the two-hit or single-hit algorithm. The two-hit option requires two word hits on the same diagonal to extend from either one. When set to two-hit mode, the -A parameter specifies how close the two hits have to be to trigger extension.

 

Options

0

Two hit

1

Single hit

-q [negative integer]

 

Default: -3

Programs: blastn only

Sets the penalty for a nucleotide mismatch. Also see -r. The choice of [integer] for -q and -r are very important because they determine your target frequencies. The default values-r 1 -q -3 are most effective for aligning sequences that are 99 percent identical. See Appendix B for more information on nucleotide scoring schemes.

 

-Q [1..23]

 

Default: 1

Programs: blastx, tblastx

Genetic code to use for translation of the query nucleotide sequence. See the -D parameter for list of genetic codes.

 

-r [integer]

 

Default: 1

Programs: blastn only

Sets the score of a nucleotide match. See the -q parameter and Appendix B.

 

-R [checkpoint file]

 

Default: Optional

Programs: psitblastn

Designates the PSI-BLAST checkpoint file to be used in the psitblastn search. -p must be set to psitblastn. The input must be a protein sequence and be the same one used withblastpgp -C to generate the [checkpoint file].

 

-S [1..3]

 

Default: 3

Programs: blastn, blastx, tblastx

Chooses which strand of DNA-based queries is searched.

 

Options

1

Top strand

2

Bottom strand

3

Both strands

For example, the following command searches only the query's top strand.

blastall -p blastn -d nr -i query -S 1

-t [integer]

 

Default: 0

 

Length of the largest intron allowed in tblastn for linking HSPs. A default of 0 means that linking is turned off.

 

-T [T/F]

 

Default: F

Programs: All

Produces HTML output with <anchor> links from the summary at the top of the report to the alignments farther below. This option should be used only with the standard report format (-m 0).

 

-v [integer]

 

Default: 500

Programs: All

Sets the number of database sequences for which to show the one-line summary descriptions at the top of a BLAST report. You won't be warned if you exceed [integer]. Also see the-b parameter.

 

-w [integer]

 

Default: 0

Programs: blastx only

Sets the frame shift penalty for the Out Of Frame (OOF) algorithm of blastx. When -w is set, it invokes the OOF mode of BLASTwhich lets alignments proceed across reading frames. The expect values calculated from OOF blastx are only approximate, and BLAST issues the following warning when OOF is invoked:

[NULL_Caption] WARNING: test500: Out-of-frame option

selected, Expect values are only approximate and

calculated not assuming out-of-frame alignments

The out-of-frame alignments are signified by slashes that indicate the +1(/),+2(//), -1(\), and -2(\\) frameshifts. The following is a sample OOF alignment:

Query: 23  PLIRNSL/YCINC\\A//QSIIRAHVKGPYLTRWVVNC/E\TCSKGYAKTPGASTDLLLL 160

    PLIRNSL YCINC     QSIIRAHVKGPYLTRWVVNC   TCSKGYAKTPGASTDLLLL

Sbjct: 1   PLIRNSL YCINC  X  QSIIRAHVKGPYLTRWVVNC X TCSKGYAKTPGASTDLLLL 53

Query: 161 YKTRNSLTSASSLSPVRSQRMI/N\SFPRFQGHLVVSG/S\SAHNR/FS\FNRDSPRGSG 322

    YKTRNSLTSASSLSPVRSQRMI   SFPRFQGHLVVSG   SAHNR F  FNRDSPRGSG

Sbjct: 54  YKTRNSLTSASSLSPVRSQRMI X SFPRFQGHLVVSG X SAHNR FX FNRDSPRGSG 107

Query: 323 SYCSREPMGQIKIRRTHTDDKLFR/ND\SRHTRAGDGLNI//TLA\\RDPSFLSRVYNAN 484

    SYCSREPMGQIKIRRTHTDDKLFR    SRHTRAGDGLNI   L   RDPSFLSRVYNAN

Sbjct: 108 SYCSREPMGQIKIRRTHTDDKLFR XX SRHTRAGDGLNI  XLX  RDPSFLSRVYNAN 161

Query: 485 SYLHI 499

    SYLHI

Sbjct: 162 SYLHI 166

 

-W [integer]

 

Defaults: blastn 11, others 3

Programs: All

Sets the word size for the initial word search. The minimum word size for blastn is 7. Word sizes for blastp, blastx, tblastn, and tblastx are 2 or 3.

 

-X [integer]

 

Default: blastn 30, others 15

Programs: All, except tblastx

Sets the X2 dropoff value for gapped alignments. The value is measured in bits. Smaller values of X2 result in earlier termination of extensions. Adjusting this parameter is generally unnecessary.

 

-y [integer]

 

Default: blastn 20; other 7

Programs: All

Sets the X1 dropoff value (in bits) for extensions. The lower X1 is set, the shorter the extension will be. It's rarely necessary to adjust this parameter.

 

-Y [real number]

 

Default: 0

Programs: All

The effective length of the search space. This is the size of the database multiplied by the size of the query or MN from the Karlin-Altschul equation.

If -Y is unset or set to 0, the actual size of the database and query is used.

 

-z [real number]

 

Default: 0

Programs: All

The effective length of the database. This option is useful for maintaining consistent statistics over time as databases grow.

If -z is unset or set to 0, the actual effective length of the database is used.

 

-Z [integer]

 

Default: 25

Programs: All

Sets the X3 dropoff value (in bits) for extensions but is bounded by the value for X2. It's generally not necessary to adjust this parameter.

 

13.4 formatdb Parameters

formatdb turns FASTA files into BLAST databases (ASN.1 format is also acceptable, but because it isn't commonly used, it isn't covered in this book. You can find more information about ASN.1 at http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html/). Chapter 11 discusses the typical methods for building BLAST databases and examines the NCBI identifier syntax required for some aspects of formatdb and blastall. Here are a few sample command lines:

formatdb -i protein_db

formatdb -p F -i nucleotide_db

zcat est*.gz | formatdb -p F -i stdin -o -n est -v 2000000000

The following reference lists the default value for each formatdb parameter.

-B [file]

 

Default: Optional

 

Specifies a binary GI output file. The advantage of using a binary GI file is that it's smaller than a corresponding text file and can be read directly into memory without being parsed. See the -F option.

To convert a text GI file to binary, use the following command:

formatdb -F text_gi_list -B binary_gi_list

 

-F [file]

 

Default: Optional

 

Specifies a GI file, either text or binary. This is used for creating an alias database that doesn't contain sequences, but pointers to sequences stored in another database (which may be an alias database as well). See the -L parameter. The databases must use the NCBI FASTA identifier syntax, include GI numbers, and be indexed with -o.

 

-i [file]

 

Default: Required

 

Sets the input FASTA file. You may specify that input come from stdin with -i stdin, but you must also set the -n parameter to give it a name. If you wish to make a single BLAST database from multiple FASTA files, pipe them to formatdb as follows:

cat file1 file2 file3 | formatdb -i stdin -n my_db

 

-l [file]

 

Default: formatdb.log

 

Specifies an output log file. Log messages are appended to this file.

 

-L [file]

 

Default: Optional

 

Creates an alias database, which has several uses. It can be a simple synonym for another database, a selection of specific records from a database (see the -F option), or a static virtual database. Alias databases have the .pal or .nal extension, depending on whether they are proteins or nucleotides.

To create an alias database with a selected set of GI numbers:

formatdb -i db -F gi_list -L alias_name -p [T/F]

To merge databases, first create a synonymous alias and then edit it to include additional database names. Chapter 11 covers this process in more detail.

 

-n [string]

 

Default: Optional, required with -i stdin

 

Sets the base name for the BLAST database. If not specified, the name of the FASTA file will be used. If the input is from stdin, this parameter must be set.

 

-o [T/F]

 

Default: Optional

 

Creates indexes. Indexing the databases isn't required but is recommended. Alias databases that use GI lists (see -F and -L options) and the blastall -l option require indexed databases. Additionally, some blastall output options specified with the -m parameter require indexing. Indexing adds four files with extensions .nnd.nni.nsd, and .nsi for nucleotides and .pnd.pni.psd, and .psi for proteins. If you know you don't need indexes, you can save space by omitting -o.

If GI numbers are included and more than one sequence has the same GI number, formatdb terminates with an error. If accession numbers aren't unique, an error won't be issued (see -V).

 

-p [T/F]

 

Default: T

 

Specifies the type of type of file being formatted. By default, formatdb assumes the file is protein, so you must set -p F whenever you format nucleotide databases.

 

-s [T/F]

 

Default: Optional

 

Creates indexes for accessions but not locus names. Must be used in conjunction with the -o parameter. For many sequences from DDBJ/GenBank/EMBL, the locus name and accession number are identical and some disk space can be saved by not including redundant information. In general, locus names are historical relics, so always include -s.

 

-t [string]

 

Default: Optional

 

The title for a database file. If this parameter isn't set, the title of the database will be the name of the FASTA file or the argument of -n, if it was set. -t lets you use more descriptive names that you might not want as filenames. For example:

formatdb -i proteins -t "my favorite human proteins"

In the BLAST report, this is reported in the header as:

Database: my favorite human proteins

Using this parameter can be confusing, because backtracking from reports to databases might be difficult.

 

-v [integer]

 

Default: Optional

 

The maximum number of sequence bases to be created in a volume. Values range from 1 to 2147483647 (2 billion in powers of two). This parameter is useful if the filesystem doesn't support large files. Volumes with greater than [integer] letters are automatically split, and an alias is created. See Chapter 9 for more information.

 

-V [T/F]

 

Default: F

 

Reports warning messages if sequence identifiers aren't unique. Requires the -o option.

 

13.5 fastacmd Parameters

fastacmd retrieves sequences, individually or in batches, from BLAST databases. When using it, you don't have to keep FASTA files on your file system after you've formatted the BLAST database. Sequences are stored in a case-insensitive format, however, so if you use lower- and uppercase for semantic purposes, this information will be lost.

Here are a few sample command lines using fastacmd:

fastacmd -d nr -s P02042

fastacmd -d nr -s 12837002,P02042

fastacmd -d nr -D

fastacmd -d est -i file_of_gi

cat file_of_gi | fastacmd -d est -i stdin

The following reference lists the default value for each fastacmd parameter.

-a [T/F]

 

Default: F

 

Retrieves all accessions even duplicates when using -s or -i to retrieve sequences. If -a isn't set, only the first accession of duplicates is retrieved.

 

-c [T/F]

 

Default: F

 

Uses Control-A as a nonredundant definition line separator. This parameter applies only to nonredundant databases with concatenated definition lines. By default, a normal space is used as the separator. Using Control-A unambiguously separates sequence definitions.

 

-d [string]

 

Default: nr

 

The database from which to retrieve sequences.

 

-D [T/F]

 

Default: F

 

Dumps the entire database in FASTA format.

 

-i [file]

 

Default: Optional

 

A batch retrieval. The format of the text file is one GI or accession per line. stdin is a valid file.

cat file_of_gi | fastacmd -d est -i stdin

 

-I

 

Default: Optional

 

Prints information about a formatted database. Overrides all other retrieval options. Needs to be used with -d.

fastacmd -d my_db -I

 

-l [integer]

 

Default: 80

 

Sequences line length. The most common values are 50 (a nice round number), 60 (evenly divisible by 3), and 80 (a traditional terminal width).

 

-L [integer],[integer]

 

Default: 0,0

 

Extracts a region of the sequence. Using as the start coordinate indicates the actual beginning of the sequence. Using 0 as the end coordinate indicates the end of the sequence. A colon and the sequence range are appended to the identifier to signify the region extracted.

fastacmd -d nr -s AAG39070 -L 10,50

>gi|11611819:10-50 (AF287139) Hoxa-11 [Latimeria chalumnae]

SGPDFSSLPSFLPQTPSSRPMTYSYSSNLPQVQPVREVTFR

 

-o [file]

 

Default: stdout

 

Sends the output to the named file or stdout, if none is named.

 

-p [T/F/G]

 

Default: G

 

Options

G

Guess. Look for a protein first, and then a nucleotide.

T

Protein.

F

Nucleotide.

-P [integer]

 

Default: Optional

 

Retrieves sequences with this PIG.

 

-s [string]

 

Default: Optional

 

An identifier of the sequence to retrieve. The identifier may be a GI or accession. To retrieve multiple sequences, the identifiers must be separated by commas as follows:

fastacmd -d nr -s AAG39070,11611819

To retrieve a large number of sequences, using the -i parameter is more convenient, especially since there may be limits on the length of command-line strings.

 

-S [1..2]

 

Default: 1

 

The strand on subsequence. Only used with nucleotide sequences.

1

Top strand

2

Bottom strand

 

-t [T/F]

 

Default: F

 

The definition line should contain target GI only. This parameter applies only to nonredundant databases. When set, only the definition line corresponding to the GI is reported, not the redundant definition lines. No such mechanism exists for accession numbers; redundancies are always reported.

 

-T [T/F]

 

Default: F

 

Gets taxonomy information from an NCBI-formatted BLAST database. The downloadable FASTA files don't allow this feature; only the preformatted will work. The preformatted databases can be found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/FormattedDatabases/.

 

13.6 megablast Parameters

megablast is similar to blastn but optimized to find near identities very quickly. It's much faster than the standard blastn, partly because it uses query packing. The extension algorithm differs from the standard blastn and isn't designed for cross-species searches. Many parameters are identical between megablast and blastall, but some are unique to one program or the other, and some parameters with the same symbol do different things.

Here are a few example command lines:

megablast -d my_db -i my_query -F "m D"

megablast -d my_db -i my_query -D 2 -t 18 -W 11

-a [integer]

 

Default: 1

 

The number of processors; same as blastall.

 

-A [integer]

 

Default: 40

 

The two-hit algorithm window size; same as blastall.

 

-b [integer]

 

Default: 250

 

The number of database sequences to show; same as blastall, if -D 2 is set.

 

-d [string]

 

Default: nr

 

The database; same as blastall.

 

-D [0..3]

 

Default: 0

 

The type of output. The -m option applies only if -D 2 is set here.

 

Options

0

One-line output for each alignment in the form of:

'subject-id'=='[+-]query-id' (s_beg q_beg s_end q_end) Score

For example:

'AF071362'=='+AF071357' (1 715 200 920) 8

Score for non-affine gapping parameters (the default) is the total number of differences (mismatches + gaps); it's the actual raw score when using affine gapping.

1

Same as the output of -D 0, but additionally shows the endpoints and percent identity for each ungapped segment in the alignment.

#'>AF071362'=='+AF071357' (1 715 200 920) 8

a {

  s 8

  b 1 715

  e 200 920

  l 1 715 26 740 (96)

  l 27 742 27 742 (100)

  l 28 744 47 763 (100)

  l 48 765 50 767 (100)

  l 51 769 60 778 (100)

  l 61 780 133 852 (100)

  l 134 854 200 920 (99)

}

s

Score.

b

Begin coordinates for the subject and query, respectively.

e

End coordinates for subject and query, respectively.

l

Coordinates for each ungapped segment with the percent identity in parentheses at the end.

2

A traditional BLAST output.

3

A tab-delimited, one-line format. The 12 reported tab-delimited fields are as follows:

Query

Subject

Percent identity

Alignment length

Mismatches

Gap openings

Query start

Query end

Subject start

Subject end

E value

Bit score

-e [real number]

 

Default: 1,000,000

 

The expectation value; same as blastall. However, it's set to a very large number, so there is effectively no cutoff.

 

-E [integer]

 

Default: 0

 

Setting -E and -G turns on affine gapping (same as standard blastall). This causes megablast to use more memory and isn't necessary when the sequences are expected to be nearly identical. When -E and -G aren't set, the gap extension penalty is calculated from the match (-r) and mismatch (-q) so that E = r/2 -q. E is rounded down to the nearest integer. So, for the default +1/-3 matrix, the gap extension penalty equals 3.

 

-f [T/F]

 

Default: F

 

Shows full IDs of the database sequences in the output. The default is only the accession, or just the GI if no accession is given. Applies to -D 0, -D 1, and -D 3.

 

-F [T/F] [string]

 

Default: T

 

Filters the query sequence; same as blastall.

 

-G [integer]

 

Default: 0

 

Setting -E and -G turns on affine gapping (same as standard blastall). This causes megablast to use more memory and isn't necessary when the sequences are expected to be nearly identical.

 

-H [integer]

 

Default: 0

 

The maximum number of HSPs to save per database sequence. The default of 0 means "unlimited."

 

-i [file]

 

Default: stdin

 

The query file; same as blastall.

 

-I [T/F]

 

Default: F

 

Shows GI numbers in database deflines; same as blastall.

Can be used only with -D 2.

 

-l [file]

 

Default: Optional

 

Restricts search to a list of GI numbers; same as blastall.

 

-L [string]

 

Default: Optional

 

The location on query sequence; same as blastall.

 

-m [0..11]

 

Default: 0

 

Alignment view options. Must set -D 2, then it's the same as blastall.

 

-M [integer]

 

Default: 20000000 (20 million)

 

The maximum total length of queries for a single search. Reducing this number reduces the amount of memory required by megablast.

 

-n [T/F]

 

Default: F

 

Uses dynamic programming extension for affine gap scores. The default is to use a greedy algorithm for an extension.

 

-N [0,1,2]

 

Default: 0

 

The type of discontiguous template. To use discontiguous seeding, -t must be set to 16, 18, or 21, and -W must be 11 or 12.

Discontiguous templates don't require the usual exact word match employed by the other BLAST programs, but use a template pattern that must be matched to seed an alignment. If a template is specified by 1s and 0s, for example, with 1 representing required matches and 0 representing residues that need not match, then you can represent a template size 16 with a word size of 11 as:

1,110,010,110,110,111

 

Options

0

Coding template. This discontiguous template uses a pattern of 110 to match coding sequence where the third codon position is variable (and therefore set to 0 and not required to match). Here are all coding template combinations:

110,110,110,110,110,1         [11 of 16]

111,110,110,110,110,1   [12 of 16]

10,110,110,010,110,110,1      [11 of 18]

10,110,110,110,110,110,1      [12 of 18]

10,010,110,010,110,010,110,1  [11 of 21]

10,010,110,110,110,010,110,1  [12 of 21]

1

Optimal. This template pattern tries to minimize the correlation between successive words. Here are all optimal template combinations:

1,110,010,110,110,111 [11 of 16]

1,110,110,110,110,111 [12 of 16]

111,010,010,110,010,111      [11 of 18]

111,010,110,010,110,111      [12 of 18]

111,010,010,100,010,010,111  [11 of 21]

111,010,010,110,010,010,111  [12 of 21]

2

Simultaneous optimal and coding. This option increases sensitivity by allowing seeding from a match to either template at a given position.

-o [file]

 

Default: Optional

 

Output file; same as blastall.

 

-p [real number]

 

Default: 0

 

Percent identity cutoff. Alignments less than [real number] aren't reported. If using -D 0, all alignments are kept regardless of percent identity (no trace-back is performed, so percent identity can't be calculated).

 

-P [integer]

 

Default: 0

 

The maximum number of positions for a hash value. If set to nonzero, redundant subsequences will be masked in the word seeding phase. This allows a simple type of filtering by masking out subsequences that occur in the query sequences more than [integer] times. When the word size (-W) is set to 16 or higher, -P applies to subsequences of length 12; it applies to subsequences of length 8 when -W is set less than 16.

 

-q [negative integer]

 

Default: -3

 

Mismatch penalty; same as blastall.

 

-Q [file]

 

Default: Optional

 

Masked query output. Each query sequence is reported to [file], but with any region hit turned to Ns. This works only in conjunction with -D 2.

 

-r [integer]

 

Default: 1

 

Match score; same as blastall.

 

-R [T/F]

 

Default: Optional

 

Reports a short log message at the end of the run.

 

-s [integer]

 

Default: Optional

 

The minimum hit score to report. All alignments scoring less than [integer] aren't reported. By default, this is set to the word size, which results in all hits being reported.

 

-S [0..3]

 

Default: 3

 

The strands to search; same as blastall.

 

-t [16,18,21]

 

Default: Optional

 

Sets discontiguous template size. This, combined with the word size (-W) of either 11 or 12 and the template type (-N), sets discontiguous megablast.

 

-T [T/F]

 

Default: F

 

The HTML output; same as blastall, but is active only if -D 2 is set.

 

-U [T/F]

 

Default: F

 

Lowercase filtering; same as blastall.

 

-v [integer]

 

Default: 500

 

The number of one-line descriptions. Same as blastall if -D 2 is set.

 

-W [integer]

 

Default: 28

 

Word size. The default word size is very high because sequences aligned by megablast are expected to be nearly identical. For discontiguous searches (-t), word size can be only 11 or 12. megablast generates words every four bases (similar to the WU-BLAST wink parameter), so using a word size divisible by four assures that all words of that size will be found.

 

-X [integer]

 

Default: 20

 

The X dropoff value for a gapped alignment; same as blastall.

 

-y [integer]

 

Default: 10

 

The X dropoff value for an ungapped extension; same as blastall.

 

-z [real number]

 

Default: 0

 

The effective length of a database; same as blastall.

 

-Z [integer]

 

Default: 50

 

The X dropoff value for a dynamic programming gapped extension.

 

13.7 bl2seq Parameters

bl2seq runs the basic BLAST searches on two sequences. Many parameters are identical between bl2seq and blastall, but some are unique to one program or the other, and some parameters with the same symbol do different things.

Here are a few sample command lines:

bl2seq -p blastp -i protein1 -j protein2

bl2seq -p blastn -i nucleotide1 -j nucleotide2 -F F -D 1

bl2seq -p blastx -i nucleotide -j protein

bl2seq -p tblastn -i protein -j nucleotide

bl2seq -p tblastx -i nucleotide1 -j nucleotide2

The following reference describes the parameters for bl2seq.

-a [file]

 

Default: Optional

 

Specifies the SeqAnnot output file. The [file] will be in the Abstract Syntax Notation 1 (ASN.1) format for import into and use with the NCBI toolbox.

 

-A [T/F]

 

Default: F

 

Input sequences are NCBI identifiers. When set to T, the program makes an online connection to the NCBI databases to retrieve the FASTA sequences.

bl2seq -A -p blastx -i AF287139 -j AAG39070

(This function was just enabled in the 2.2.6 release.)

 

-d [real number]

 

Default: 0

 

Sets the theoretical size of the database. This is useful for maintaining consistent E-values between blastall and bl2seq searches. Identical to the blastall -z parameter. If -d isn't set, the database size is set to the length of the -j sequence.

 

-D [0/1]

 

Default: 0

 

Sets the output format to tabular, which corresponds to the blastall setting -m 8. The other -m report options available in blastall aren't available in bl2seq.

Unlike the blastall parameter of the same name, -D doesn't set the genetic code for translating database sequences. All bl2seq translations use the standard nuclear genetic code.

 

Options

0

Traditional

1

Tabular

-e [real number]

 

Default: 10

 

The expectation value; same as blastall.

 

-E [integer]

 

Default: 1

 

The gap extension value; same as blastall.

 

-F [T/F] [string]

 

Default: T

 

Complexity filtering; same as blastall.

 

-g [T/F]

 

Default: T

 

The gapped alignment; same as blastall.

 

-G [integer]

 

Defaults: blastn 5, others 11

 

The gap initiation penalty; same as blastall.

 

-i [file]

 

Default: Required

 

Sets the input (query) file for the search. For blastx, [file] must be nucleotide, and for tblastn, [file] must be protein. Setting [file] to stdin or using multisequence files isn't recommended.

 

-I [integer],[integer]

 

Default: 0,0

 

The location on the input sequence defined by -i. Follows the blastall -L syntax.

 

-j [file]

 

Default: Required

 

Sets the database file for the search. For blastx, [file] must be protein, and for tblastn, [file] must be nucleotide. Setting [file] to stdin or using multisequence files isn't recommended.

 

-J [integer],[integer]

 

Default: 0,0

 

The location on a sequence defined by -j. Follows the blastall -L syntax.

 

-m [T/F]

 

Default: F

 

Sets a blastn search to megablast mode; same as blastall -n.

 

-M [string]

 

Default: BLOSUM62

 

The scoring matrix, same as blastall.

 

-o [file]

 

Default: Optional

 

The output file; same as blastall.

 

-p [string]

 

Default: None, required parameter

 

The program name; same as blastall.

 

-q [negative integer]

 

Default: -3

 

The nucleotide mismatch score; same as blastall.

 

-r [integer]

 

Default: 1

 

The nucleotide match score; same as blastall.

 

-S [1..3]

 

Default: 3

 

The search strand; same as blastall.

 

-t [integer]

 

Default: 0

 

The longest intron allowed in tblastn for linking HSPs; same as blastall.

 

-T [T/F]

 

Default: F

 

HTML output; same as blastall.

 

-U [T/F]

 

Default: F

 

Lowercase masking; same as blastall.

 

-W [integer]

 

Defaults: blastn 11, others 3

 

The word size; same as blastall.

 

-X [integer]

 

Default: blastn 30, others 15

 

The extension cutoff; same as blastall.

 

-Y [real number]

 

Default: 0

 

The search space; same as blastall.

 

13.8 blastpgp Parameters (PSI-BLAST and PHI-BLAST)

blastpgp is the program used to run PSI-BLAST and PHI-BLAST. These programs are specialized protein BLAST comparisons that are more sensitive than the standard BLASTP search. PSI-BLAST considers position-specific information when searching for significant hits. PHI-BLAST uses a pattern, or profile, to seed an alignment, which is then extended by the normal BLASTP algorithm.

13.8.1 PSI-BLAST

PSI-BLAST (position-specific iterated BLAST) uses a specialized scoring matrix that assigns scores to each position (hence, position-specific) in the query sequence based on alignments defined by consecutive iterations of searches (hence, iterated). The specialized matrix is a position-specific scoring matrix (PSSM) that assigns a score for every amino acid at each position in the query sequence (See Figure 13-1).

Figure 13-1. PSSM for the first 10 amino acids of the coelacanth HoxA11 protein

figs/blst_1301.gif

Figure 13-1 shows a portion of a PSSM calculated for the coelacanth Hoxa11 protein (AAG39070). The query amino acids are numbered in the left column with the position-specific scores for each of the 20 amino acids shown across each row. The diverse scores of the three Tyrosines (Y) at positions 1, 7, and 8 highlight the position-specific aspect of this scoring scheme compared to traditional BLAST matrices, which would contain the same scores for Y in all three positions.

The PSSM, or checkpoint file, is created internally by PSI-BLAST, but it can also be exported to a file using the -C option of blastpgp. This option is extremely useful. You can use the checkpoint file in subsequent PSI-BLAST (blastpgp) searches or as a database entry for the RPS-BLAST program. You can also use the PSSM in a specialized tblastn search in blastall by using the -p psitblastn and -R <checkpoint file> options with a nucleotide database.

To run PSI-BLAST, the -j parameter must be set to something greater than 1. The default of -j 1 means that there are no iterations and that it's therefore the same as a single BLASTP search. Setting -j sets the maximum number of iterations to run, with the program stopping beforehand if the search comes to convergence. Convergence occurs when no new sequences are found that are better than the E value threshold set by the -h parameter.

Here are a few sample command lines:

blastpgp -d nr -i my_protein -s T -j 5

blastpgp -d nr -i my_protein -R my_protein.ckp -d nr -j 5 -h 0.001

13.8.2 PHI-BLAST

PHI-BLAST stands for pattern-hit initiated BLAST. The program uses an input sequence and a defined pattern to query a protein database. The pattern is defined in PROSITE format (http://ca.expasy.org/prosite/)and is used as the seed for the alignment. The pattern is used instead of the words that are usually generated for seeding alignments in BLASTP. Here's a sample profile:

ID  HoxA11 pattern1

PA  Y-S-[SA]-X-[LVIM]

The profile's syntax has a line starting with ID, followed by two spaces and the name of the pattern. The name is free text. The next line should start with PA, followed by two spaces, and then the pattern in PROSITE format. The PROSITE format is simple. A dash (-) separates letters, an X means any letter, and the brackets ([]) specify a choice of amino acids. You can find more information on the pattern syntax in the README.bls file that comes with the NCBI-BLAST distribution.

Additionally, if the pattern occurs more than once in the query and you would like to limit which occurrences are used as seeds, specify those locations by using the HI (hit initiation) tag in the pattern file. You set -p to seedp instead of patseedp (explained in the reference section that follows). The following example specifies that the pattern starting at position 143 should be used. (In this case, there's also an occurrence at 34, which is ignored.)

ID  HoxA11 pattern2

PA  Y-S-[SA]-X -[LVIMK]

HI  143

PHI-BLAST can also be a jumping-off point for a PSI-BLAST run. In the following command line, the pattern in hit_file initiates the first iteration of PSI-BLAST for the development of the PSSM, followed by normal rounds of PSI-BLAST iterations.

blastpgp -d nr -i my_protein -k hit_file -p patseedp -j 5

Here are a few sample PHI-BLAST command lines:

blastpgp -d nr -i my_protein -k hit_file -p patseedp

blastpgp -d nr -i my_protein -k multi_hit_file -p seedp

blastpgp -d HoxDB.pep -i AAG39070.pep -k hit_file.hox -p patseedp

The following reference describes parameters used with blastpgp, which executes PSI- and PHI-BLAST searches.

-a [integer]

 

Default: 1

 

The number of processors to use; same as blastall.

 

-A [integer]

 

Default: blastn 0, others 40

 

The multiple-hit window size; same as blastall.

 

-b [integer]

 

Default: 250

 

The number of alignments to show; same as blastall.

 

-B [file]

 

Default: Optional

Program: PSI-BLAST only

The input alignment file for a PSI-BLAST restart. It allows a PSI-BLAST run to start with a curated multiple sequence alignment instead of allowing the program to generate it from the first round of database alignments. For example:

blastpgp -i query -B multiple_alignment -j 5 -d nr

The alignment file must be based on the Clustal format but without the header and footer. The file should have a row for each sequence and can be broken into blocks separated by one or more blank lines. The query file (specified by -i) must be included in the alignment (though it doesn't need to be the first one), and all rows must be padded with dashes (—-) to make them equal lengths. Also, each column must contain either all uppercase or lowercase letters. An uppercase letter signifies that the column should be given a position-specific score; a lowercase letter means that the matrix (specified by -M) score should be used. Here is a portion of the example alignment file included in README.bls (the query is 26SPS9_Hs, in this case):

26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllc

F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymll

YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlky

YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymil

FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvn

COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetad

644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvs

YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvt

eif-3p110_Hs  SKAMKMGDWKTCHSFIINEKMNGkvw---------------

T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw---------------

YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavis

KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyv

F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvit

Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklase

26SPS9_Hs     kimlntpedvqalvsgklalryagrqtealkcvaqasknr

F57B9_Ce      ckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk

YDL097c_Sc    mllskimlnliddvknilnakytketyqsrgidamkavae

YMJ5_Ce       ckimlneteqlagllaakeivayqkspriiairsmadafr

FUS6_ARATH    kaeqnpetlepmvnaklrcasglahlelkkyklaarkfld

COS41.8_Ci    eqlqihykvcyarvldyrrkfleaaqrynelsyksaihet

644879 kaestpeiaeqrgerdsqtqailtklkcaaglaelaarky

YPR108w_Sc    glftlertdlkskvidspellslisttaalqsissltisl

eif-3p110_Hs  ----------------------------------------

T23D8.4_Ce    ----------------------------------------

YD95_Sp       gaisldrvdvktkivdspevlavlpqnesmssleacinsl

KIAA0107_Hs   smialerpdlrekvikgaeilevlhslpavrqylfslyec

F49C12.8_Hs   ttfaldrpdlrtkvircnevqeqltggglngtlipvreyl

Int-6_Mm      ilmqnwdaamedltrlketidnnsvssplqslqqrtwlih

 

-c [integer]

 

Default: 9

Program: PSI-BLAST only

Sets a constant in pseudocounts for PSSM. It's generally not necessary to change this parameter.

 

-C [file]

 

Default: Optional

Program: PSI-BLAST only

Outputs a file for PSI-BLAST checkpointing. This outputs the final PSSM for a multipass run of PSI-BLAST. The checkpoint file can then be used in a PSI-BLAST restart (see -R), in ablastall -p psitblastn run (also see -R), or as an entry in an RPS-BLAST database.

blastpgp -d nr -i my_protein -j 5 -C my_protein.ckp

 

-d [string]

 

Default: nr

 

The database name; same as blastall.

 

-e [real]

 

Default: 10

 

The expectation value; same as blastall.

 

-E [integer]

 

Default: blastn 2, others 1

 

The penalty to extend a gap; same as blastall.

 

-f [integer]

 

Default: 11

 

The threshold for extending a hit; same as blastall.

 

-F [string]

 

Default:

 

Filters the query sequence; same as blastall.

 

-g [T/F]

 

Default: T

 

Performs gapped alignment; same as blastall.

PHI-BLAST requires gapping and therefore forbids -g F.

 

-G [integer]

 

Defaults: blastn 5, others 11

 

The penalty to open a gap; same as blastall.

 

-h [real number]

 

Default: 0.005

Program: PSI-BLAST only

The E-value threshold for inclusion in PSSM. All alignments better than this threshold are used in constructing the PSSM.

 

-H [integer]

 

Default: -1

 

The end of the required region in query. The default of -1 indicates the actual end of the query. This option can be used in combination with -S to specify a particular region to use

 

-i [file]

 

Default: stdin

 

The query file; same as blastall.

 

-I [T/F]

 

Default: F

 

Shows GIs in defline; same as blastall

 

-j [integer]

 

Default: 1

 

The maximum number of passes to use in a multipass version. The default of 1 is just a regular BLASTP search.

 

-J [T/F]

 

Default: F

 

Believes the query definition line; same as blastall.

 

-k [file]

 

Default: hit_file

Program: PHI-BLAST only

Specifies the file containing the PROSITE pattern to be used for seeding in a PHI-BLAST run. If -k isn't specified when running PHI-BLAST (e.g. -p patseedp or -p seedp), the program looks for a file called hit_file.

 

-K [integer]

 

Default: 0 - Off

 

The number of best hits from a region to keep; same as blastall.

 

-l [string]

 

Default: Optional

 

Restricts the search of the database to a list of GIs; same as blastall.

 

-L [integer]

 

Default: 0 (disabled)

 

The cost to decline an alignment.

 

-m [0..9]

 

Default: 0

 

Alignment view options; same as blastall.

 

-M [string]

 

Default: BLOSUM62

 

The matrix; same as blastall.

 

-N [real number]

 

Default: 22.0

 

The number of bits required to trigger gapping.

 

-o [file]

 

Default: Optional

 

The output file for alignment; same as blastall.

 

-O [file]

 

Default: Optional

 

SeqAlign file output; same as blastall.

-p [string]

 

Default: blastpgp

 

Specifies whether to run in PSI- or PHI-BLAST mode.

 

Options

blastpgp

PSI-BLAST mode

patseedp

PHI-BLAST mode. Uses all occurrences of the hit_file pattern to seed alignments. Any HI tags (see later) in the hit_file are ignored.

seedp

PHI-BLAST mode. The specified pattern is found more than once in the query, and the hit_file specifies which to use as seeds. The specific pattern(s) occurrences to use is specified with the HI tag in the hit_file. For example, the following hit_file designates seeding from a pattern that occurs at position 143 of the coelacanth HoxA11 protein:

ID  HoxA11 pattern2

PA  Y-S-[SA]-X-[LVIMK]

HI 143

seedp throws an exception if the hit_file doesn't contain the HI tags.

-Q [file]

 

Default: Optional

 

Output file for a PSI-BLAST matrix in ASCII format. This [file] can't be used in any subsequent programs. Use -c to output a matrix for subsequent searches.

 

-R [file]

 

Default: Optional

 

Input checkpoint file for PSI-BLAST restart. Uses the checkpoint file. Output with -c.

 

-s [T/F]

 

Default: F

 

Calculates locally optimal Smith-Waterman alignments. Because of the heuristic nature of BLAST, it sometimes produces nonoptimal local alignments. This option causes BLAST to run the full Smith-Waterman alignment algorithm on subjects found by the normal BLAST heuristic. There may be some speed cost using this option, but it helps guarantee high-quality alignments, which are important in PSSM generation. Setting -s T is highly recommended.

 

-S [integer]

 

Default: 1

 

The start of the required region in query. Used in combination with -H, this sets a specific region of the query to be used when generating the PSSM.

 

-t [T/F]

 

Default: T

 

Uses composition-based statistics. With this set to T, the score is adjusted based on composition biases in the query and subject sequences. Using it helps avoid possible corruption of the PSSM because it introduces low-entropy false positives in the multiple sequence alignment.

 

-T [T/F]

 

Default: F

 

Produces HTML output; same as blastall.

 

-U [T/F]

 

Default: F

 

Uses lowercase filtering of a query sequence; same as blastall.

 

-v [integer]

 

Default: 500

 

The number of one-line descriptions to show; same as blastall.

 

-W [1..3]

 

Default: 3

 

The word size; same as blastall.

 

-X [integer]

 

Default: 15

 

The X dropoff for gapped alignments; same as blastall.

 

-y [real number]

 

Default: 7.0

 

X dropoff for ungapped extensions; same as blastall

 

-Y [real number]

 

Default: 0

 

The effective length of the search space; same as blastall.

 

-z [real number]

 

Default: 0

 

The effective database size; same as blastall.

 

-Z [integer]

 

Default: 25

 

The X dropoff for final gapped alignment; same as blastall.

 

13.9 blastclust Parameters

blastclust clusters a database of protein or nucleotide sequences. It outputs rows of sequence identifiers from the database with clustered sequences occurring on the same row and clusters sorted from largest to smallest. The program can generate a list of clusters for input into another program (e.g., an alignment program such as PHRAP); however, it should be used only on a relatively small number of sequences (10-1000) because it runs only on a single computer, and the RAM requirements quickly exceed most capacities.

Here are a few sample command lines:

blastclust -i my_nucdb -p F -o my_nucdb.clusters

blastclust -i my_pepdb -o my_pepdb.clusters -L 0.7 -S 90

The following reference describes parameters used with blastclust.

-a [integer]

 

Default: 1

Programs: All

Specifies the number of CPUs to use on a multiprocessor machine.

 

-b [T/F]

 

Default: T

 

Requires coverage on both sequences. If set to T, the program requires both sequences to pass the coverage criteria set with -L before they are called neighbors and clustered together.

 

-c [file]

 

Default: Optional

 

Specifies a configuration file with advanced options. The configuration file is simply a list of the options that you commonly use.

 

-C [T/F]

 

Default: F

 

The crash recovery option. Set it to complete unfinished clustering. Set to T if using the -r option with a file to restore the clustering. Use the same command line as the crashed run with the same -s, with only -C, T, and -r being added. This restarts the run using the hit list file specified by -r and then appending to it (as specified by -s).

 

-d [file]

 

Default: Optional

 

The input file is a BLAST database, not a FASTA file.

 

-e [T/F]

 

Default: F

 

Enables ID parsing in the database-formatted report.

 

-i [file]

 

Default: stdin

 

Specifies the FASTA input file for clustering.

 

-l [file]

 

Default: Optional

 

Restricts the reclustering to the IDs specified in [file]. It can be useful when you have a very large FASTA database and wish to cluster a subset of sequences.

 

-L [real number]

 

Default:0.9

 

Specifies the length of coverage threshold.

 

-p [T/F]

 

Default: T

 

Input sequences are proteins. Set to F for nucleotides.

 

-r [file]

 

Default: Optional

 

Specifies the file used to restore neighbors for reclustering. Set -C to T. This file is created by the -s command of a previous run. Use it if the program crashes during a run.

 

-s [file]

 

Default: Optional

 

Specifies the file in which to save the hit list. This file can restore a crashed run and is the input file specified by -r.

 

-v [file]

 

Default: stdout

 

Prints progress messages. Progress is reported to standard output if no file is specified.

 

-W [integer]

 

Default: Protein 3, Nucleotide 32

 

The word size; same as blastall.