BLAST (Basic Local Alignment Search Tool)

Chapter 10. Installation and Command-Line Tutorial

This chapter shows you how to install NCBI-BLAST and WU-BLAST software on your own computer. This is necessary if you want to use BLAST in a high-throughput setting or develop specialized applications. After the installation section, this chapter presents a command-line tutorial that also serves as a test suite to make sure your BLAST installation behaves as expected.

10.1 NCBI-BLAST Installation

NCBI-BLAST, as the name implies, is available from the National Center for Biotechnology Information (NCBI). Precompiled binaries and source code are available for free and without restriction. The source code is in the public domain, so there are quite a few derivative works, both commercial and free (see Chapter 12). NCBI-BLAST is currently available as precompiled binaries for 11 popular operating system-hardware combinations. In addition, there is this very generous statement in the README.bls file:

BLAST binaries are provided for IRIX6.2, Solaris2.6 (Sparc) Solaris2.7 (Intel), DEC OSF1 (ver. 5.1), LINUX/Intel, HPUX, AIX, BSD Unix, Darwin, MacIntosh, and Win32 systems. We will attempt to produce binaries for other platforms upon request.

If you have a platform that isn't supported as a precompiled binary, you may wish to take up the offer from the NCBI, or you may be able to find one using an Internet search engine such as Google. You can also compile the executables yourself; the source code may be obtained as part of the NCBI toolbox: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools. For more information about the toolbox, see http://www.ncbi.nlm.nih.gov/IEB/ToolBox.

This chapter will take you through the installation procedures for Unix, Windows, and Macintosh. It doesn't cover how to build the NCBI executables from source. If you are a Windows or Macintosh user, please read the Unix installation first because it has some information that isn't duplicated in the other sections.

10.1.1 Unix Installation

The first step is to download a compressed Unix tape archive, often called a tarball, to your computer. Find the appropriate executable for your system at ftp://ftp.ncbi.nih.gov/blast/executables. A note of caution here: the files in the tarball aren't contained in a subdirectory so it is a good idea to place the tarball in its own directory before you expand the archive. If you're downloading via a browser, you may have plug-ins that automatically expand the archive. This could leave you with a bunch of files all over your system, or it may create a directory for you. To be safe, if you're using a browser, download the tarball to a new directory, for example, /usr/local/pkg/ncbi-blast, or perhaps ncbi-blast in your home directory if you don't have root access.

If the archive hasn't already been expanded, you can expand it with this command, where your_platform_name will be something like linux.tar.Z or linux.tar.gz:

tar -xzf blast.your_platform_name

Not all versions of tar support the -z option above, in which case you can use the following command line:

zcat blast.your_blastform_name | tar -xf -

10.1.1.1 Files and directories

More than 20 files come with the installation. Table 10-1 shows the files and a very brief description in logical order. See the NCBI-BLAST reference in Chapter 13 for comprehensive coverage of each program.

Table 10-1. NCBI-BLAST installation files

File

Description

blastall

The main blast executable. This program runs the five most common BLAST programs: blastnblastpblastxtblastn, and tblastx.

blastpgp

The executable for running PSI-BLAST and PHI-BLAST searches.

bl2seq

Program to align two sequences with the BLAST algorithms.

megablast

Specialized nucleotide BLAST algorithm optimized to rapidly find nearly identical sequences that differ due to sequencing or other similar errors. This can also be called within the BLASTALL program using the -n option.

data/

Directory that contains the scoring matrices and other information necessary for default running of BLAST.

formatdb

Program for formatting BLAST databases from either FASTA or ASN.1 formats.

fastacmd

Program to retrieve sequences from a BLAST database if it was formatted using the -o option of formatdb.

rpsblast

Reverse PSI-BLAST program. This program searches a query sequence against a database of profiles. This is the reverse of PSI-BLAST, which uses a profile to search against a database of sequences.

seedtop

A companion program to PHI-BLAST that can find the positions of patterns in a sequence and all sequences that contain a particular pattern.

blastclust

Program to automatically cluster protein or nucleotide sequences based on pairwise matches.

impala

Integrated Matrix Profiles and Local Alignments. Used to search a database of score matrices (prepared by copymat) and produce BLAST-like output.

makemat

Primary profile preprocessor for IMPALA. Converts a collection of binary profiles into ASCII format.

copymat

Secondary profile preprocessor for IMPALA. Converts ASCII matrix profiles, produced by makemat, into a database that can be read into memory quickly.

README.bcl

Instruction file for blastclust program.

README.bls

Instruction file for blastall program.

README.formatdb

Instruction file for formatdb program.

README.imp

Instruction file for impala program.

README.mbl

Instruction file for megablast program.

README.rps

Instruction file for rpsblast program.

VERSION

Version and build information.

10.1.1.2 The .ncbirc file

The next step is to create a resource file that tells blastall where to find its scoring matrices and other related files. The contents of the file are just these two lines:

[NCBI]

Data="/usr/local/pkg/ncbi-blast/data/"

You may also add to this file a line giving the location of the BLAST database files.

[BLAST] BLASTDB=path_to_db

This file must be named .ncbirc (including the leading dot) and should be located in every user's home directory (although it can also be in the directory where blastall resides).

10.1.1.3 Setting the PATH and BLASTDB environment variables

The next step is to make sure the programs can be called without explicit paths—that is, without having to type the full pathname every time you want to run the program. You should either place symbolic links from the executables in /usr/local/bin or modify your PATH environment variable. If you're not sure how to do this, ask your Unix system administrator to help you or consult an introductory Unix book.

The final step allows you to select databases by name rather than by explicit path. This is more than just a convenience; the abstraction also lets you provide a similar interface on multiple machines where the underlying directory structure may be different. Here is an example of what you might put in your .cshrc file if you use csh or its derivatives as your shell:

setenv BLASTDB /usr/local/blastdb

If you're using one of the sh derivatives, such as bash, use the following:

export BLASTDB=/usr/local/blastdb

That's it, except that you can't use the software without sequences. If you don't need to know about Mac or Windows installation, skip ahead to the command-line tutorial.

10.1.2 Windows Installation

Download the blastz.exe file from ftp://ftp.ncbi.nih.gov/blast/executable, and place this in its own directory, such as C:\ncbi-blast\. This is a self-extracting archive, so you can simply double-click on it, and all the files will be extracted into the current directory. See Table 10-1 for a description of all the files.

10.1.2.1 The ncbi.ini file

Similar to the Unix install, a special file must be created with the path to the data directory. Create a file called ncbi.ini in either the Windows or WINNT directory with the following contents:

[NCBI]

Data="C:\ncbi-blast\data"

Unlike Unix, rather than setting the BLASTDB environment variable to the location of the BLAST databases, add the following to the ncbi.ini file.

[BLAST]

BLASTDB="C:\ncbi-blast\db"

10.1.2.2 Setting the PATH environment variable

The PATH environment variable works like its Unix counterpart. The easiest way to set it is to right-click on the My Computer icon, click on the Advanced tab, and then click on the Environment Variables button (Figure 10-1).

Figure 10-1. Selecting the PATH environment variable

figs/blst_1001.gif

This brings up the System Variables window. Select the Path variable to edit and add ;C:\ncbi-blast to the end of the Path (Figure 10-2) Note that there's a semicolon before the C, which is the separator between directories. Now the BLAST executables can be used from any DOS prompt.

Figure 10-2. Adding the PATH environment variable

figs/blst_1002.gif

10.1.3 Macintosh OS X Installation

MacOS X is Unix under the hood, so you can follow the previous Unix installation procedures (the file is called blast.darwin.tar.Z because Darwin is the actual name of the Unix that MacOS X uses). Alternatively, you can use the friendly installer available from the folks at http://bioteam.net who have put together a CD containing quite a few common bioinformatics application suites including Apple-Genentech-BLAST (an optimized version of NCBI-BLAST, see Chapter 12). The CD image is located athttp://gm.sonsorol.org:8080/BioInfxToolsInstaller.cdr.

The installation procedure could not be much simpler. Double-click on the BioInfxToolsInstaller.cdr image, open the BioInfxToolsInstaller that appears on your desktop, and then double-click the agncbi12-20-2001.pkg. This launches a typical installer, and after a few clicks and keystrokes, you're done. At the end, you need to do two more things: add one line to your .cshrc file and copy the .ncbirc file to your home directory. To do this, open the Terminal application and type the following two lines exactly as they appear here:

echo "source/usr/local/biotools/cshrc.biotools" >> ~/.cshrc

cp /usr/local/biotools/.ncbirc ~/.ncbirc

10.1.4 Macintosh OS 9 Installation

The OS 9 archive is called blast.hqx. If you click on the file icon, your browser will most likely launch the appropriate tools to automatically expand the archive. If not, you can use Stuffit Expander, which is available for free from http://www.stuffit.com. The OS 9 applications look completely different from the command-line versions because they all have a graphical interface. Don't worry about this because the interface isn't pretty, and you have to drag the window across your screen several times to see all the buttons and text fields. (You may also experience a few system crashes because OS 9 isn't the ideal environment for BLAST.) You must also create a special file to tell BLAST where to find its data directory. Create a file called ncbi.cnf in your system folder that contains the path to the data folder. For example, if the data folder is in a computer named MyMac and in a folder called Blast, the ncbi.cnf file should look like this:

[BLAST]

BLASTDB=MyMac:Blast:data

Installation instructions for OS 9 are included for completeness, but Apple no longer supports this operating system. You might want to upgrade to OS X or install one of the Linux distributions for PPC. If you install Linux, you may have to compile the executables from the source, but it's worth checking if anyone has already done this. A Google search for "Mac linux BLAST" is a good place to start.

10.2 WU-BLAST Installation

Obtaining WU-BLAST software is slightly more complicated than NCBI-BLAST because it requires a license from Washington University in St. Louis. If you are affiliated with an academic institution or a nonprofit organization, the license is free. If you are part of a for-profit enterprise, you must pay a licensing fee. The price is expensive by shrink-wrapped software standards, but is similar to other bioinformatics software packages available from universities. If you find the cost prohibitive, an earlier version of WU-BLAST is available for free. The free version contains fewer features, and is available for a limited number of operating systems, but for most people, it works just fine. If your operating system isn't supported and your specific use doesn't require gapped alignment, a free version of the classic, ungapped BLAST with public domain source code also exists. This older version, 1.4.9, is nearly identical to NCBI-BLAST Version 1.4, which is no longer available from the NCBI.

Should you wish to license WU-BLAST or download the free versions, visit the official site for the WU-BLAST software at http://blast.wustl.edu. The free versions can be downloaded with a couple clicks, but more patience is required for the licensed version. After the license is issued, you will be sent a user-specific URL from which to download the software. It's a good idea to save this information because you will use it again to download the free updates. Licensed users are notified by email as new features are added (usually a few times per year).

WU-BLAST is available only for Unix operating systems. If you don't have access to a Unix computer, you can run Linux or FreeBSD under a virtual machine with products such as VMWare (http://www.vmware.com) or VirtualPC (http://www.connectix.com).

10.2.1 Expanding the tarball

The software comes as a compressed Unix archive, or tarball. First, create a directory such as /usr/local/pkg/wu-blast; if you don't have root access, create a wu-blast directory inside your home directory. Next, download the tarball to that directory. If you do this from a browser, the files may be extracted automatically. If not, use the following command, where your_platform_name will be something like linuxi686.tar.Z:

tar -xzf blast2.your_platform_name

Not all versions of tar support the -z option above, in which case you can use the following command line:

zcat blast.your_blastform_name | tar -xf -

Before you continue with the rest of the installation procedures, look at what's inside the tarball.

10.2.2 Files and Directories

There are a number of files and two subdirectories. The most important items are described very briefly in Table 10-2 in logical, rather than alphabetical, order. See the WU-BLAST reference in Chapter 14 for more information.

Table 10-2. WU-BLAST files and directories

Name

Description

blasta

The WU-BLAST executable. Unlike the free version, which comes with five different BLAST executables, the licensed version has only one.

blastnblastpblastxtblastntblastx

Symbolic links (aliases) to blastablasta figures out what kind of program to run based on the name of the symbolic link.

xdformat

Executable for formatting both nucleotide and protein databases.

xdget

Executable that allows you to retrieve sequences by accession number from a WU-BLAST database.

nrdbpatdb

Programs used to create nonredundant databases. nrdb keeps only unique sequences and concatenates the descriptions of identical sequences. patdb goes a little further and removes sequences that are perfect substrings of other sequences.

gb2fastagt2fasta,pir2fastasp2fasta

Programs to convert GenBank, SwissProt, and PIR files to FASTA files. gb2fasta extracts the nucleotides, and gt2fasta extracts the proteins.

filter

Directory containing the complexity filtering programs used by WU-BLAST (segdust, and xnu).

matrix

Directory containing two subdirectories, aa and nt, which contain, respectively, the amino acid and nucleotide scoring matrices. The amino acid matrices like BLOSUM 62 are singular files, but the nucleotide matrices exist in two forms, with the extension 4.2 or 4.4 that corresponds to 4- and 16-symbol matrices.

setdbpressdb

Executable used to format protein and nucleotide databases. The xdformat executable replaces these programs, but they are included for those who prefer the old interface or require compatibility with older executables.

wu-blastallwu-formatdb

Perl scripts that mimic the NCBI-BLAST command-line interface while executing the WU-BLAST counterparts.

sysblast

Configuration file that allows administrators to enforce system-level resource limitations on BLAST jobs.

10.2.3 Executables

Let's assume the tarball has been downloaded to /usr/pkg/wu-blast, and you normally keep your executables in /usr/local/bin. Issue the following commands to put the executables in your path.

ln -s /usr/pkg/wu-blast/blasta /usr/local/bin/blastn

ln -s /usr/pkg/wu-blast/blasta /usr/local/bin/blastp

ln -s /usr/pkg/wu-blast/blasta /usr/local/bin/blastx

ln -s /usr/pkg/wu-blast/blasta /usr/local/bin/tblastn

ln -s /usr/pkg/wu-blast/blasta /usr/local/bin/tblastx

ln -s /usr/pkg/wu-blast/xdformat /usr/local/bin

ln -s /usr/pkg/wu-blast/xdget /usr/local/bin

Note, unlike the NCBI program blastallblasta can not be executed by its own name, but only through aliases.

10.2.4 Environment Variables

You'll need to set three environment variables: BLASTDB, BLASTMAT, and BLASTFILTER. These variables correspond to the locations of the databases, scoring matrices, and complexity filters. WU-BLAST environment variables use a colon-delimited list of locations, like the PATH variable. This is especially useful for database files, which can be placed in several locations in the filesystem and then be accessed by name rather than explicit path. This is convenient because it allows computers to access databases on a networked server or on their local disks, and this is invisible to the user. Databases are looked for from a colon-delimited list of locations defined in the BLASTDB environment variable (similar to the PATH variable for executables). If BLASTDB isn't set, blasta looks in the current directory and in /usr/ncbi/blast/db. In these cases, FASTA databases of the same name must be present (or symbolic links to such databases). It's generally a better idea to use the BLASTDB variable because this strategy uses less disk space and is much less confusing.

Two environment variables, BLASTMAT and BLASTFILTER, must be set so blasta can find the scoring matrices and complexity filters. These variables also use colon-delimited lists, but there's little reason to have them in more than one location.

Now set the BLASTMAT and BLASTFILTER environment variables to the explicit paths of the matrix and filter directories (we'll assume that the software was unpackaged in/usr/local/wu-blast). Here's how to do so in csh and its derivatives:

setenv BLASTMAT /usr/local/wu-blast/matrix

setenv BLASTFILTER /usr/local/wu-blast/filter

And in sh and its derivatives:

BLASTMAT=/usr/local/wu-blast/matrix

BLASTFILTER=/usr/local/wu-blast/filter

export BLASTMAT BLASTFILTER

10.2.5 Setting Resource Limits with /etc/sysblast

WU-BLAST has a special file called /etc/sysblast that sets systemwide resource limitations for each machine running BLAST jobs. The /etc/sysblast file currently supports three commands: nicecpus, and cpusmax. The nice value gives BLAST processes a lower priority (nice values are generally in the range of 1 to 20, with 20 being the least demanding). If the computer is used for other jobs, such a workstation, setting this to 5 makes the workstation more responsive, but the BLAST job will take over at idle times. The cpus value is the default number of CPUs to use, and cpusmax defines the maximum number of CPUs allowed. These two should be set on any large, multiprocessor machine. Here is a sample /etc/sysblast file:

nice = 5

cpus = 1

cpusmax = 4

The behavior of WU-BLAST on multiprocessor systems is worth discussing, and if you're one of the lucky people who have access to a computer with 16 processors or more,/etc/sysblast will definitely help you. WU-BLAST lets users control the number of CPUs with the -cpus command-line parameter. If this parameter isn't given an explicit value, the programs uses all the processors in the computer (except for BLASTN, which reins itself in at four processors). While this may be good for BLAST users, your other users may not be so happy. This is where the /etc/sysblast file is critical because it allows you to modify the default behavior and set limits for CPU usage.

10.3 Command-Line Tutorial

Now that you've installed the NCBI-BLAST and/or WU-BLAST software, it's time to try it out. To do this, you will need sequences for both queries and databases. It's generally a good idea to start with something small and make sure it works before attempting to analyze large databases. To begin, download the book's example files from http://examples.oreilly.com/BLAST. Copy the Sequence directory to a local hard disk. This directory contains several database and sequences in FASTA format. The six files you will need for testing are described in Table 10-3.

Table 10-3. Contents of the Sequence directory

Name

Description

ESTs

2000 nucleotide sequences from the nematode Caenorhabditis elegans. These sequences originated from http://www.wormbase.org.

EST

A single EST from the previous collection.

globins

1203 protein sequences corresponding to the globin family (Pfam Version 7.6). These sequences and other protein families are available from http://www.sanger.ac.uk/Software/Pfam.

globin

A single protein from the previous collection.

AF287139

Latimeria chalumnae (Coelacanth) Hoxa-11 nucleotide sequence.

AAG39070

Latimeria chalumnae (Coelacanth) HoxA-11 protein sequence.

fugu_genomic

Takifugu rubripes (Pufferfish) genomic sequence containing globin genes.

chicken_genomic

Gallus gallus (Chicken) genomic sequence containing globin genes.

HoxDB

Nine homeobox protein sequences from different organisms.

p53

The Drosophila melanogaster p53 protein sequence.

p53DB

Twelve p53 homologous protein sequences from different organisms.

hit_file.p53

p53 pattern file for PHI-BLAST search.

10.3.1 NCBI-BLAST

In the following subsections, you'll run the various components of the BLAST software you just installed. You'll see brief descriptions of each program and its arguments, but the programs all have more options those displayed. See Chapter 13 for more information about each program. To begin, change your directory to the sequence directory.

10.3.1.1 formatdb

We'll start by formatting the ESTs database with the following command:

formatdb -i ESTs -p F -l ESTs.logfile -o

The -i ESTs indicates the name of the input file. The -p F indicates this isn't a protein database. The -l ESTs.logfile creates a log file. The -o creates indexing files so that sequences may be retrieved from the database. This command creates six files: ESTs.nhrESTs.ninESTs.nsdESTs.nsiESTs.nsq, and ESTs.logfile. With the exception of the log file, the files are in binary format and not human-readable. The log file looks like this (the version number will differ depending on your installation):

========================[ Oct 9, 2002 11:40 AM ]========================

Version 2.2.2 [Jan-08-2002]

Started database file "ESTs"

Formatted 2000 sequences

(END)

Now, format the globins database with the following command:

formatdb -i globins -p T -l globins.logfile -o

The parameters here are similar to those previous, with the exception of -p T because this is a protein database. Because this is the default condition, you can omit -p T. This command also creates six files: globins.phrglobins.pinglobins.psdglobins.psiglobins.psq, and globins.logfile.

10.3.1.2 blastn

Let's query the ESTs database with a single EST from that database and see if you can find it:

blastall -p blastn -d ESTs -i EST > ncbi-blastn_test

After the standard header, you should see output identical to this:

Query= AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 :

5' end, single read.

  (322 letters)

Database: ESTs

    2000 sequences; 673,456 total letters

Searching....done

    Score     E

Sequences producing significant alignments:    (bits)  Value

AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5...   632  0.0

AU109042 AU109042  Caenorhabditis elegans cDNA clone:yk702b3 : 5...   567  e-163

AU109359 AU109359  Caenorhabditis elegans cDNA clone:yk705g2 : 5...   436  e-124

AU110478 AU110478  Caenorhabditis elegans cDNA clone:yk718h12 : ...   383  e-107

AU109122 AU109122  Caenorhabditis elegans cDNA clone:yk703d7 : 5...    32  0.040

AU110305 AU110305  Caenorhabditis elegans cDNA clone:yk718a9 : 5...    30  0.16

The summary of the report shows that the top sequence has accession number AU108953, which is the same as the query used in the file named EST.

If you would like to benchmark the performance of your system as a way to compare different machines, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. You'll see this same benchmark in Chapter 12. Read that chapter to understand why this is or isn't a good benchmark for your system. Here is the Unix command line for the benchmarking procedure:

time blastall -p blastn -d ESTs -i ESTs > /dev/null

10.3.1.3 megablast

megablast is designed for high-performance blastn searches. Do the same two searches as in previous sections.

megablast -d ESTs -i EST > megablast-test

The default output is tabular. For more information on this format, see Appendix A.

'AU108953'=='+AU108953' (1 1 322 322) 0

'AU109042'=='+AU108953' (18 14 324 319) 5

'AU109359'=='+AU108953' (12 58 282 322) 13

'AU109359'=='+AU108953' (12 58 282 322) 13

'AU110478'=='+AU108953' (25 21 240 236) 8

If you want to see how much faster megablast is than standard blastn, time them both. To compare them on equal footing, set the blastn word size higher (-W 30) and use tabular output (-m 8).

time blastall -p blastn -d ESTs -i ESTs -W 30 -m 8 > /dev/null

time megablast -d ESTs -i ESTs > /dev/null

(megablast ran about 12 times faster on the Ian's laptop in these tests.)

10.3.1.4 blastp

Run blastp by searching a single protein against a database of proteins:

blastall -p blastp -d globins -i globin > ncbi-blastp_test

This BLAST report is fairly long, and the following is only part of the summary. The top of your report should look like this:

Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.

  (145 letters)

Database: globins

    1203 sequences; 211,084 total letters

Searching...done

    Score     E

Sequences producing significant alignments:    (bits)  Value

Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.       293  6e-82

HBG_ATEGE P06891 Hemoglobin gamma chain.   277  4e-77

HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain.       276  8e-77

HBG_ALOBE P56284 Hemoglobin gamma chain.   275  1e-76

Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN).      274  2e-76

Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN.        273  4e-76

You can also benchmark blastp in the same way as blastn, by doing all-versus-all protein searches. This will take about twice as long to complete as the blastn benchmark.

time blastall -p blastp -d globins -i globin > /dev/null

10.3.1.5 blastx

It is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database:

blastall -p blastx -d globins -i fugu_genomic > ncbi-blastx_test

The genomic sequence contains a cluster of alpha globin genes, so you can expect to find quite a few matches to the globins database. Your report should look like this:

Query= gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene

cluster, complete sequence

  (35,793 letters)

Database: globins

    1203 sequences; 211,084 total letters

Searching...done

    Score     E

Sequences producing significant alignments:    (bits)  Value

Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN.        137  3e-46

HBA_THUTH P11748 Hemoglobin alpha chain.   119  3e-40

HBA2_NOTCO P16308 Hemoglobin alpha-2 chain.       130  5e-40

HBA2_TRENE P45719 Hemoglobin alpha-2 chain.       126  6e-40

Q98974 Q98974 ALPHA-GLOBIN IV.      119  7e-39

Q98SE6 Q98SE6 ALPHANCP2 (ALPHANCP1).       140  9e-39

HBA_ELEEL P14520 Hemoglobin alpha chain.   119  9e-39

10.3.1.6 tblastn

Run tblastn in a typical application that searches an EST databases for protein similarities.

blastall -p tblastn -d ESTs -i globin  > ncbi-tblastn_test

Here's the first alignment of the report in addition to the summary. We'll leave the question of biological and statistical significance as an exercise for you.

Query= Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.

  (145 letters)

Database: ESTs

    2000 sequences; 673,456 total letters

Searching....done

   Score     E

Sequences producing significant alignments:    (bits)  Value

AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5...    28  0.043

AU109149 AU109149  Caenorhabditis elegans cDNA clone:yk703g6 : 5...    27  0.096

AU109370 AU109370  Caenorhabditis elegans cDNA clone:yk705h2 : 5...    25  0.37

AU110425 AU110425  Caenorhabditis elegans cDNA clone:yk716a3 : 5...    24  0.62

AU109448 AU109448  Caenorhabditis elegans cDNA clone:yk706h11 : ...    20  6.9

AU109900 AU109900  Caenorhabditis elegans cDNA clone:yk712d7 : 5...    20  6.9

AU110608 AU110608  Caenorhabditis elegans cDNA clone:yk720c9 : 5...    20  9.0

AU109391 AU109391  Caenorhabditis elegans cDNA clone:yk706b11 : ...    20  9.0

>AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5' end,

    single read.

   Length = 327

 Score = 27.7 bits (60), Expect = 0.043

 Identities = 12/34 (35%), Positives = 20/34 (58%)

 Frame = -1

Query: 21  VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54

    + + G   + R+  V P +QRF  S  ++CSP+A

Sbjct: 144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43

10.3.1.7 tblastx

tblastx is commonly used as a gene-finding tool to identify potential coding regions between genomes. You can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first and then run the search:

formatdb -i chicken_genomic -p F

blastall -p tblastx -d chicken_genomic -i fugu_genomic> ncbi-blastx_test

You should see the following error message while your search is running. This warns you that there are more than 200 significant alignments between the two sequences. You can't increase this number with a command-line switch, and the only workaround is to cut your query sequence into smaller pieces. This limitation applies only in ungapped searches (TBLASTX or any search with the setting -g F).

[blastall] WARNING:  [000.000] gi|18463974|gb|AY016024.1|: Reached max 200 HSPs in

BlastSaveCurrentHsp, continuing with this limit

Since there's only one sequence in the database, look at the first alignment instead of the summary:

>gi|17104478|gb|AY016020.1| Gallus gallus alpha globin gene cluster,

      complete sequence

   Length = 103190

 Score =  141 bits (302), Expect = 1e-34

 Identities = 61/73 (83%), Positives = 66/73 (89%)

 Frame = -2 / -3

Query: 25076 ASSDDMTLTSPSMDNSSAELLPGGDSPLNKRITETLLASLSEHERQVILSVPAAQNPEDL 24897

      ASSDDMTLTSPSMDNSSAEL+PGGDSPLNKR+TE LLASL EHER+ IL+VPAAQNPEDL

Sbjct: 5289  ASSDDMTLTSPSMDNSSAELIPGGDSPLNKRMTENLLASLLEHEREAILNVPAAQNPEDL 5110

Query: 24896 RMFAR*NHLSTKC 24858

      RMFAR*   S +C

Sbjct: 5109  RMFAR*EIGSAEC 5071

Note the stop codon (*) near the end of the alignment. The scoring matrices distributed with BLAST set the score of all stop codon matches to +1. If you want to terminate alignments at stop codons, you have to edit the scoring matrix. This procedure is described at the end of this chapter.

10.3.1.8 bl2seq

The previous tblastx test used only two sequences. Whenever you want to align just two sequences, you can use the bl2seq program to save the effort of having to format one sequence as a database and remove the database files later. Here's the command:

bl2seq -p tblastx -j chicken_genomic -i fugu_genomic

You won't see a summary, but the rest of the report will be nearly identical to the one in the previous section.

10.3.1.9 fastacmd

The fastacmd program can retrieve one or more FASTA-formatted sequences from a BLAST database. Try this by retrieving a single sequence from the ESTs database:

fastacmd -d ESTs -s AU108953 -l 60

The -d ESTs designates the database from which you retrieve the sequence. The database must have been formatted with the -o option of formatdb for fastacmd to work. The -sAU108953 is the string for which to search the database. The -l 60 specifies the sequence line lengths; the default line length is 80. The output from this test should be the following:

>lcl|AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5' end, sing

le read

TGGCCTACTGGANAAAACAACAATGCGTGCTTTACTATTCACCTCTGTTGTTCTTTTGGC

TTTGGCTTTTGTTGAGGCAAAGAAGCAGACTATCACTGTCAAGGGTACAACTATTTGTAA

TAAGAAGAGAATTCAGGCGGAGGTTACCCTTTGGGAGAAAGATACTCTCGACCCCGATGA

CAAGCTCGCCTCAATGCAATCGAACAAAGAAGGAGAGTTCTCACTTACCGGATCCGACGA

CGAGATCACCTCAATCTCTCCATACCTCATAATCACCCACAACAGCAACGTGAAGAAGGC

CGGATGAAGCCGTGTTTCAGAG

The beginning of the definition has a prepended lcl|. The lcl| isn't in the original FASTA definition line and doesn't show up in BLAST reports; it shows up here because the database identifier wasn't in NCBI format (see Chapter 11).

fastacmd has several other useful features. It can report a summary of a BLAST database, dump a BLAST database to FASTA format, report a subsequence of a particular sequence, and even display taxonomic information if the databases are downloaded from ftp://ftp.ncbi.nih.gov/blast/db/. See Chapter 13 for more information on fastacmd.

10.3.1.10 PSI-BLAST

PSI-BLAST identifies weak amino acid similarities and is executed using the blastpgp program. Test it by iteratively searching the Drosophila melanogaster p53 protein against a database of p53 homologs. First, format the p53 database, just as you would for any protein database:

formatdb -i p53DB -p T -l p53DB.logfile -o

Now you can search with your p53 protein{

blastpgp -d p53DB -i p53 -j 5 > ncbi-psiblast_test

You'll get a report with four iterations. All four rounds of automated searching are concatenated into one report. The following represents the summary section for these four rounds.

In round 1, there are 11 matches out of 12 total database sequences. Notice that there is an insignificant hit to a hypothetical protein at the bottom of the report with a score of 20 and an E-value near 1.

Query= gi|7211767|gb|AAF40427.1|AF224713_1 transcription factor p53

[Drosophila melanogaster]

  (385 letters)

Database: p53DB

    12 sequences; 3745 total letters

Results from round 1

Score    E

Sequences producing significant alignments:       (bits) Value

gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]     98   3e-024

gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...    75   2e-017

gb|AAA98563.1| p53 tumor suppressor homolog        70   9e-016

gb|AAC24830.1| p53 homolog [Homo sapiens]   63   1e-013

gb|AAC37335.1| p53 [Canis familiaris]       59   1e-012

gb|AAC05704.1| tumor suppressor p53 [Mus musculus]        58   4e-012

gb|AAC60746.1| p53 [Xenopus laevis] 57   9e-012

gb|AAD12203.1| tumor suppressor protein [Canis familiaris]       43   1e-007

gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]     42   2e-007

gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...    35   4e-005

gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]       20   0.98

In round 2, all 12 sequences from the p53DB have hits better than the E-value threshold of 10. All scores are higher and an annotated p53 peptide fragment from mouse (gb|AAC71764.1|) is now found with a significant score.

Results from round 2

        Score    E

Sequences producing significant alignments:       (bits) Value

Sequences used in model and found again:

gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   293   4e-083

gb|AAC24830.1| p53 homolog [Homo sapiens] 291   3e-082

gb|AAC05704.1| tumor suppressor p53 [Mus musculus]       279   7e-079

gb|AAC37335.1| p53 [Canis familiaris]      279   9e-079

gb|AAA98563.1| p53 tumor suppressor homolog       278   1e-078

gb|AAC60746.1| p53 [Xenopus laevis]        270   5e-076

gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]    251   3e-070

gb|AAD12203.1| tumor suppressor protein [Canis familiaris]      227   4e-063

gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   221   2e-061

gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    127   4e-033

Sequences not found previously or not previously below threshold:

gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]        40   8e-007

gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]       25   0.028

In round 3, the scores of all hits are even better as the position-specific scoring matrix is increasingly tuned to the p53 profile.

Results from round 3

Score    E

Sequences producing significant alignments:       (bits) Value

Sequences used in model and found again:

gb|AAC60746.1| p53 [Xenopus laevis]        323   6e-092

gb|AAC05704.1| tumor suppressor p53 [Mus musculus]       309   7e-088

gb|AAA98563.1| p53 tumor suppressor homolog       306   5e-087

gb|AAC24830.1| p53 homolog [Homo sapiens] 304   2e-086

gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   295   1e-083

gb|AAC37335.1| p53 [Canis familiaris]      285   2e-080

gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]    271   2e-076

gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   241   2e-067

gb|AAD12203.1| tumor suppressor protein [Canis familiaris]      231   3e-064

gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    124   3e-032

gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]        43   1e-007

Sequences not found previously or not previously below threshold:

gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]       32   3e-004

In round 4, no new sequences are found below the level of significance (0.005). Therefore the search has converged, and no new iterations are performed.

Results from round 4

Score    E

Sequences producing significant alignments:       (bits) Value

Sequences used in model and found again:

gb|AAC60746.1| p53 [Xenopus laevis]        333   3e-095

gb|AAL99584.1|AF285104_1 p53-like transcription factor p120 [Spi...   319   6e-091

gb|AAC24830.1| p53 homolog [Homo sapiens] 319   6e-091

gb|AAC05704.1| tumor suppressor p53 [Mus musculus]       313   5e-089

gb|AAA98563.1| p53 tumor suppressor homolog       311   3e-088

gb|AAC37335.1| p53 [Canis familiaris]      289   8e-082

gb|EAA07957.1| agCP1306 [Anopheles gambiae str. PEST]    272   6e-077

gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...   241   2e-067

gb|AAD12203.1| tumor suppressor protein [Canis familiaris]      230   3e-064

gb|AAF78533.1|AF223793_1 tumor supressor p53 [Oncorhynchus mykiss]    125   2e-032

gb|AAM96822.1| hypothetical protein [Arabidopsis thaliana]       70   1e-015

gb|AAC71764.1| p53 protein; Trp53 [Mus musculus musculus]        44   8e-008

Sequences not found previously or not previously below threshold:

CONVERGED!

10.3.1.11 PHI-BLAST

PHI-BLAST seeds extensions from important regions of a protein—for example, an enzyme active site or a conserved domain. It is also executed with the blastpgp program. To specify a PHI-BLAST search, a hit_file must be available (-k) and the program (-p) must be specified as either patseedp or seedp, with seedp specifying a special hit_file. You'll run the program using a conserved part of the Drosophila melanogaster p53 protein as a pattern seed to search the p53DB. The pattern is designated in hit_file.p53:

ID  p53 Pattern

PA  [YF]-[ST]-X-X-L-N-K-L-[YF]

The following command uses this pattern as a seed in a search of the p53 database:

blastpgp -d p53DB -i p53 -k hit_file.p53 -p patseedp > ncbi-phiblast_test

In the results, the position of the pattern in the query and the probability of finding the pattern are shown, followed by the summary lines and alignments. In the first alignment, asterisks show the position of the pattern in the query and subject.

1 occurrence(s) of pattern in query

 p53 Pattern

 pattern [YF]-[ST]-X-X-L-N-K-L-[YF]

 at position 107 of query sequence

effective database length=3.6e+003

 pattern probability=1.4e-008

lengthXprobability=5.0e-005

Number of occurrences of pattern in the database is 4

WARNING: There may be more matching sequences with e-values below the threshold of

10.000000

       Score     E

      (bits)  Value

Significant matches for pattern occurrence 1 at position 107

gb|AAC37335.1| p53 [Canis familiaris]       47   2e-014

gb|AAC60746.1| p53 [Xenopus laevis]         41   2e-012

gb|AAC05704.1| tumor suppressor p53 [Mus musculus]        38   2e-011

gb|AAG35765.1|AF209191_1 p53 alternative splice isoform p35/HAS ...    16   7e-005

Significant alignments for pattern occurrence 1 at position 107

>gb|AAC37335.1| p53 [Canis familiaris]

   Length = 281

 Score = 47.2 bits (132), Expect = 2e-014

 Identities = 56/220 (25%), Positives = 94/220 (42%), Gaps = 27/220 (12%)

Query:  105 WMYSIPLNKLYIRMNKAFNVDVQFKSKMPIQPLNLRVFLCF--SNDVSAPVVRCQNHLSV 162

pattern 107   *********

     W YS  LNKL+ ++ K   V +   S  P     +R    +  S  V+  V RC +H  

Sbjct:  16  WTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTC-VRAMAIYKKSEFVTEVVRRCPHHERC 74

10.3.1.12 Environment variables and .ncbirc

If everything works, make sure the environment variables and data directory function correctly. Now, move to any random directory on your system and attempt one of the previous searches, but substitute the correct path to the query file.

If the executable path is incorrect, you will have an error message such as:

blastall: command not found

Databases that can't be found produce error messages like the following:

[blastall] WARNING: <query> Could not find index files for database <db>

If the data directory can't be found, blastall reports that it is unable to open the scoring matrix:

Searching[blastall] WARNING: <query> [000.000] Unable to open <matrix>

10.3.2 WU-BLAST

As in the previous section, you'll run the various components of the WU-BLAST software in typical sequence analysis settings.

10.3.2.1 xdformat

Start by formatting the ESTs database with the following command. The -n indicates that this is a nucleotide database. This command creates three files: ESTs.xndESTs.xns, and ESTs.xnt.

xdformat -n ESTs

xdformat also prints some interesting statistics to stdout. You can always retrieve this information later with xdformat -n -r ESTs.

XDFORMAT-WashU 1.0 [02-Apr-2002] [macosx-ppc-ILP32F64 2002-04-02T01:26:43]

Start:  2002-09-29T12:10:05

Input: "ESTs"

XDF Output Database:  ESTs

 Alphabet:  NCBI2na(1)

 No. of sequences (letters) written:  2000  (673,456)

 Longest sequence written (in database):  933  (933)

 Edit Alphabet:  WUStLna(1)

 Sequences edited for ambiguity codes:  961

 No. of edits (total length) written:  1901  (2102)

 Index entries written (in database):  2000  (2000)

Total cpu time:  0.12u 0.00s 0.12t  Elapsed: 00:00:00

End:  2002-09-29T12:10:05

If you use the free version of WU-BLAST, use pressdb instead of xdformat. The output is quite brief (so it's not shown here). The files are named differently, too: ESTs.csqESTs.nhd, and ESTs.ntb.

pressdb ESTs

Format the globins database with the following command. The -p indicates that this is a file or protein sequences. This command also creates three files: globins.xpdglobins.xps, and globins.xpt.

xdformat -p globins

If you use the free version of WU-BLAST, use setdb and name the files globins.ahdglobins.atd, and globins.bsq:

setdb globins

10.3.2.2 blastn

Let's query the ESTs database with a single EST from that database and see if you can find it:

blastn ESTs EST > wu-blastn_test

The output of your search contains header information and a summary that should look like this:

Query=  AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk701a6 : 5' end,

    single read.

(322 letters)

Database:  ESTs

    2000 sequences; 673,456 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

      Smallest

Sum

      High  Probability

Sequences producing High-scoring Segment Pairs:        Score  P(N)      N

AU108953 AU108953  Caenorhabditis elegans cDNA clone:yk70...  1603  3.3e-69   1

AU109042 AU109042  Caenorhabditis elegans cDNA clone:yk70...  1507  7.1e-65   1

AU109359 AU109359  Caenorhabditis elegans cDNA clone:yk70...  1203  3.8e-51   1

AU110478 AU110478  Caenorhabditis elegans cDNA clone:yk71...  1052  3.4e-44   1

AU109925 AU109925  Caenorhabditis elegans cDNA clone:yk71...   310  7.9e-11   1

AU110873 AU110873  Caenorhabditis elegans cDNA clone:yk72...   121  0.998     1

The query sequence, which was in a file named EST, has the accession number AU108953, which is the highest scoring match in the report summary. There were several other hits, and you may want to browse the rest of the file for fun.

If you wish to benchmark your system for comparisons to other systems, you can search all the ESTs against one another. This will take about 10 minutes on a 1-GHz processor. We use this same benchmark in Chapter 12 for NCBI-BLAST; read that chapter to understand why this is or isn't a good benchmark for your system. Don't compare this WU-BLAST benchmark to that produced by NCBI-BLAST because the default parameters produce different results. A proper cross-comparison requires that the results be the same. Here is the command line for the benchmark (note the addition of the -warnings flag to suppress warning messages).

time blastn ESTs ESTs -warnings > /dev/null

10.3.2.3 blastp

You can run blastp as you did blastn, by searching a single protein against a database of proteins:

blastp globins globin > wu-blastp_test

The output of this search is quite a bit longer than the previous EST example. Only part of the summary is shown here:

Query=  Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.

(145 letters)

Database:  globins

    1203 sequences; 211,084 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

      Smallest

Sum

      High  Probability

Sequences producing High-scoring Segment Pairs:        Score  P(N)      N

Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.        749  1.0e-76   1

HBG_ATEGE P06891 Hemoglobin gamma chain.   710  1.4e-72   1

HBG1_CALMO Q9GLX4 Hemoglobin gamma-1 chain.       707  2.9e-72   1

HBG_ALOBE P56284 Hemoglobin gamma chain.   705  4.7e-72   1

Q9GJS7 Q9GJS7 GAMMA1-GLOBIN (GAMMA2-GLOBIN).      703  7.7e-72   1

Q9GLX7 Q9GLX7 HYBRID GAMMA1/GAMMA2-GLOBIN.        701  1.3e-71   1

You can benchmark blastp in the same way as blastn, however, it takes about twice as long to complete as the blastn benchmark.

time blastp globins globins -warnings > /dev/null

10.3.2.4 blastx

It is common to use blastx as a gene-finding tool by searching a genomic sequence against a protein database:

blastx globins fugu_genomic filter=seg  > wu-blastx_test

While the search is running, you'll see the following warning message that indicates that some of the default parameters were overrun:

WARNING:  Descriptions of 333 database sequences were not reported due to the

   limiting value of parameter V = 500.

WARNING:  HSPs involving 583 database sequences were not reported due to the

   limiting value of parameter B = 250.

The top of the report should look like this:

Query=  gi|18463974|gb|AY016024.1| Takifugu rubripes alpha globin gene cluster,

    complete sequence

(35,793 letters)

  Translating both strands of query sequence in all 6 reading frames

Database:  globins

    1203 sequences; 211,084 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

      Smallest

Sum

      High  Probability

Sequences producing High-scoring Segment Pairs:        Score  P(N)      N

Q9PVU6 Q9PVU6 EMBRYONIC ALPHA-TYPE GLOBIN.        277  7.3e-44   3

HBAA_SERQU Q9PVM4 Hemoglobin alpha-A chain.       275  2.6e-40   3

O13136 O13136 ALPHA-GLOBIN. 278  8.6e-40   3

Q90487 Q90487 AA1 ALPHA GLOBIN.     279  1.4e-39   3

HBA_CYPCA P02016 Hemoglobin alpha chain.   278  1.8e-39   3

10.3.2.5 tblastn

It is common to use tblastn to search EST databases for protein similarities. Here's the command:

tblastn ESTs globin filter=seg  > wu-tblastn_test

The top of the report and the first alignment are shown here. You might consider the significance of the alignment as an experiment.

Query=  Q9GJY8 Q9GJY8 GAMMA2-GLOBIN.

(145 letters)

Database:  ESTs

    2000 sequences; 673,456 total letters.

Searching....10....20....30....40....50....60....70....80....90....100% done

      Smallest

        Sum

    Reading  High  Probability

Sequences producing High-scoring Segment Pairs: Frame Score  P(N)      N

AU109565 AU109565  Caenorhabditis elegans cDNA clone:y... -1    60  0.42      1

AU109149 AU109149  Caenorhabditis elegans cDNA clone:y... -1    57  0.75      1

AU109885 AU109885  Caenorhabditis elegans cDNA clone:y... +1    53  0.996     1

>AU109565 AU109565  Caenorhabditis elegans cDNA clone:yk708f8 : 5' end, single

   read.

Length = 327

  Minus Strand HSPs:

 Score = 60 (26.2 bits), Expect = 0.55, P = 0.42

 Identities = 12/34 (35%), Positives = 20/34 (58%), Frame = -1

Query:    21 VEDAGGETLGRLLVVYPWTQRFFDSFGSLCSPSA 54

      + + G   + R+  V P +QRF  S  ++CSP+A

Sbjct:   144 IREVGESPVIRIFFVLPGSQRFIVSRRAICSPTA 43

10.3.2.6 tblastx

It is common to use tblastx as a gene-finding tool to identify potential coding regions between genomes, and you can simulate this by aligning alpha globin clusters from chicken and fish. You haven't formatted the chicken genomic sequence database yet, so do this first. If you're using the free version, substitute pressdb for xdformat and don't include the -nparameter.

xdformat -n chicken_genomic

tblastx chicken_genomic fugu_genomic filter=seg  > wu-blastx_test

The following warning is reported during the search, indicating that the number of alignments has passed a default threshold. This can be avoided with hspmax=0.

WARNING:  hspmax=1000 was exceeded with 1 of the database sequences.

The first alignment is shown here:

>gb|AY016020.1| Gallus gallus alpha globin gene cluster, complete sequence

Length = 103,190

  Plus Strand HSPs:

 Score = 183 (69.5 bits), Expect = 9.2e-67, Sum P(14) = 9.2e-67

 Identities = 40/53 (75%), Positives = 44/53 (83%), Frame = +2 / +1

Query: 24479 DWKIEITGSS*LVTTRTLRNLSNSVLRCERLMFSLYMISSRWWCPRK**SSLK 24637

      D K E+TGSS LVTT TLRNLSNS+  CER MFSLYMISS+W  PRK**SSL+

Sbjct:   496 D*KTEMTGSSWLVTTSTLRNLSNSISSCERRMFSLYMISSKWCRPRK**SSLR 654

Note that the alignment contains stop codons (*). If you wish to terminate alignments at stop codons, you will have to increase the gap penalties and align without gaps. You will learn more about this topic at the end of this chapter.

10.3.2.7 xdget

The xdget program retrieves sequences from a BLAST database. You can use this only if you have the licensed software, and you must index the databases with xdformat before you can use xdget. The ESTs database can be indexed by the following command:

xdformat -n -X ESTs

You'll see some diagnostic output, and there will be an additional file called ESTs.xni. You can index at the time of formatting if you include the -I flag. The globins database is indexed similarly:

xdformat -p -X globins

You'll also see some diagnostic output that includes 33 lines similar to the following:

 *Duplicate ID:  Q17154

This indicates that you have duplicate database identifiers in the FASTA database. The duplicates are intentional, and you can remove them with the nrdb and patdb tools described later. Now retrieve some sequences:

xdget -n ESTs AU109017

You should see the following output:

>AU109017 AU109017  Caenorhabditis elegans cDNA clone:yk701g7 : 5' end, single read.

TGGCCTACTGGGGTTTAATTACCCAAGTTTGAGATGGCTGCTGCTTCAGTGAAAGGCTTT

TTCCAGCGGACCGGAATCAGCATCAAAGAATATTTTAAACGAATGGGAAATGATTATGCT

ACTGTAGCTAGGGAAACTGTCCAAGGATGTAAAGATAGACCTGTTAAAGCTGGAGTTGTT

TTCTCTGGGCTCGGTTTTTTAACCTATGCATATCAGACAAATCCAACAGAGCTGGAAATG

TATGATTATTTATGCGAGAGACGACAAAAGTTAGTTTTGGTCCCGAATTCTGAGCATAAT

CCGGCTACAACTAAAGAATTAACTGCTCGCGA

And for proteins, you can rely on a similar action:

xdget -p globins HBP_CANLI

You should see the following output:

>HBP_CANLI P42511 Leghemoglobin.

MGAFSEKQESLVKSSWEAFKQNVPHHSAVFYTLILEKAPAAQNMFSFLSNGVDPNNPKLK

AHAEKVFKMTVDSAVQLRAKGEVVLADPTLGSVHVQKGVLDPHFLVVKEALLKTFKEAVG

DKWNDELGNAWEVAYDELAAAIKKAMGSA

10.3.2.8 nrdb and patdb

The nrdb and patdb programs are useful for removing the redundant sequences you saw earlier. Use nrdb first:

nrdb globins > globins_nr

The additional output from the program is as follows:

    --------- Records ---------  -------------- Residues -----------

Database      Read  Duplicate  Written   Read   Duplicate     Written

globins       1203       39       1164      211,084      37,819     173,265

Totals:       1203       39       1164      211,084      37,819     173,265

No. of base word hits:  53 (53 total)

No. of 32-bit hash hits:  39

Total memory allocated:  0.500 MB

Longest comment line read:  184

Longest comment line written:  423

Longest sequence read:  1431

The report shows 39 duplicate records. When indexing the globins database, you'll find that there were 33 duplicate identifiers. However, nrdb looks for duplicate sequences not identifiers. When nrdb finds duplicate sequences, it concatenates the identifiers, separating them with a Control-A character. This is normally a whitespace character in a terminal window but is visible in some text editors and pagers.

patdb is even more aggressive than nrdb for removing redundancies; it removes identical substrings so the sequences MVLQ and MVLQKP are merged into the same entry.

10.3.2.9 Environment variables

Now it's time to make sure the environment variables function correctly. To start, move to any random directory on your system and attempt one of the previous searches, substituting the correct path to the query file:

If the executable path is incorrect, you will see an error message such as:

blastp: command not found

If the databases can't be found, WU-BLAST reports:

FATAL:  Could not open the database:  <database name>

WU-BLAST offers informative messages when it fails to find scoring matrices and complexity filters:

FATAL:  "No such file or directory" error encountered (errno=2), when attempting to

FIND the requested sequence filter program "nseg". Please check the setting of the

BLASTFILTER or WUBLASTFILTER environment variables and the permissions on the

program.

EXIT CODE 32

FATAL:  Could not find or open a substitution matrix file named: BLOSUM62. Check

file access permissions or the setting of the WUBLASTMAT

(BLASTMAT) environment variable.

EXIT CODE 8

For a complete list of errors and messages, see Chapter 14.

10.4 Editing Scoring Matrices

The amino scoring matrix files distributed with NCBI-BLAST and WU-BLAST assign a score of +1 to paired stop codons. This doesn't make much biological sense and reduces the ability of TBLASTX to discriminate between coding and noncoding similarities. Therefore, you should edit the scoring matrices to change stop codon pairs to a highly negative score. Be sure to edit the original matrices. The NCBI-BLAST scoring matrices are in the data directory. For WU-BLAST, they are in the matrix/aa directory. The final line of the scoring matrix files looks like this:

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

Just change the final number to -999:

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -999

You have to do this only once if you remember to keep your edited matrices when updating your BLAST installation.

For any of the translating BLAST programs, you can also change all stop scores to highly negative values. If used in conjunction with ungapped extension, doing so prevents a lot of noncoding sequences from appearing in significant alignments. The following Perl script modifies the standard matrices:

#!/usr/bin/perl

while (<>) {

    if (/^#|^\s/) {

print;

    }

    elsif (/^\*/) {

print '*', ' -999' x 24, "\n";

    }

    else {

s/\S+\s*$/\-999\n/;

print;

    }

}

Both NCBI-BLAST and WU-BLAST require matrices to have specific names. Unrecognized names cause NCBI-BLAST to terminate the search. WU-BLAST continues searching, but it employs ungapped values for l, k, and H (it issues a warning to this effect). Try to maintain the names of the matrices, but in a location with an obvious name such as stop-999-matrices. Both versions of BLAST look for scoring matrices in the local directory, and on Unix systems, they recognize the BLASTMAT environment variable. Therefore, prior to the search, you can either create a symbolic link (alias) to the scoring matrix of choice or set the BLASTMAT environment variable to point to the location of specialized matrices. In the following examples, the derivative matrices are located in /my_computer/stop-999-matrices:

ln -s /my_computer/stop-999-matrices/BLOSUM62 .

blastall -p tblastx -d db -i query

rm BLOSUM62

setenv BLASTMAT /my_computer/stop-999-matrices

blastall -p tblastx -d db -i query

unsetenv BLASTMAT

WU-BLAST users can use the altscore parameter to change the scores of any pair of letters rather than edit the matrix files. See Chapter 14 for more information on the altscoreparameter.