Sequence similarity search remotely by batch, on the online public NCBI databases

When batch blast is useful

Comparing many sequences (queries) at a time to the NCBI public databases saves time compared to performing repeated searches for each different queries, and has the advantage of comparing to the most up to date version of the public databases. It is unfortunately not possible to perform this task directly from the online access of the various BLAST programs offered on the BLAST NCBI website. The commandline programs written by the NCBI team must be run from the terminal to complete this type of batch similarity search, and the main commands and steps are detailed here.

Batch blast at the QCBS

As part of a collaborative project undertaken by a team of QCBS researchers, we aimed at isolating from the NCBI databases the orthologs of 29 different genes in 94 different species where the genome or transcriptome is partially sequences. We would soon realized that performing these searches, and downloading the best hits cannot not quickly done by multiple copying and pasting. The steps described below were done on Mac OS X 10.6.8, in summer 2012.

Downloading and installing the BLAST program

BLAST+, the new version of the BLAST programs written by NCBI with improved performance and speed now allows batch blast to be performed from your computer to the online NCBI databases. It is available for download on this NCBI ftp page. At the time, ncbi-blast-2.2.26+-universal-macosx.tar.gz was downloaded, but it is frequently updated

As explained in the BLAST+ installation instructions pages (UNIX and MAC or Windows PC), to unzip and install the program file, open the terminal, change directory until you are in the same location as the zipped file, and type the command (on a Mac):

tar xvpf ncbi-blast-2.2.26+-universal-macosx.tar.gz 

The program is now installed on your computer.

To run one of the BLAST programs (blastp, blastn, blastx…), change directory to where the BLAST+ program is located. For instance, /Applications/ncbi-blast-2.2.26+/bin.

To know the command that will be useful for your search, read the LAST+ user manual, the commanline manual, and consult the help menu for the program you want to use. To know more on the meaning of output statistics, read this NCBI tutorial For instance:

./blastp –help

Read carefully all the options, and launch the command! For instance, if the objective is to find, among the set of amino acid sequence from Aspergillus fumigatus in the non-redundant (nr) database, those that are similar to a set of 32 different Schizosaccharomyces pombe proteins that you have saved in fasta format txt file, the command typed may be:

./blastp -evalue 1e-20 -max_target_seqs 1 -db nr -query /Users/yourusernam/Documents/Folder1/Folder1_1/Folder1_1_1/Schizosaccharomyces_pombe.fasta -entrez_query "Aspergillus fumigatus[Organism]" -out /Users/yourusernam/Documents/Folder1/Folder1_1/Folder1_1_1/Aspergillus_fumigatus_blast.output.txt -remote 

If the desired output has to have only the query name, the sequence database name, the e-value, and the number of identical sites, the following options can be added at the end of the command:

-outfmt '6 qseqid sseqid evalue nident'

All this into one command:

./blastp -evalue 1e-20 -max_target_seqs 1 -db nr -query /Users/yourusernam/Documents/Folder1/Folder1_1/Folder1_1_1/Schizosaccharomyces_pombe.fasta -entrez_query "Aspergillus fumigatus[Organism]" -out /Users/yourusernam/Documents/Folder1/Folder1_1/Folder1_1_1/Aspergillus_fumigatus_blast.output.txt -remote -outfmt '6 qseqid sseqid evalue nident'

Here, we needed to extract only the gi number, that would be used to retreieve the sequence from NCBI Batch Entrez. The following command extracts the second field (-f 2), where the fields are delimited by the sign “|” (-d “|”), to a new txt file entitled Aspergillus_fumigatus_blast.output.list.onlygi.txt .

cut -f 2 -d "|" Aspergillus_fumigatus_blast.output.txt > ../onlygi/Aspergillus_fumigatus_blast.output.list.onlygi.txt
Another useful command

Saving comma or tab separated values files from the MAC OS 10.6.8 often introduced end of line characters (^M) problematic for manipulating txt files. Applying the following command can delete the ^M characters, and make the file readable for most programs.

tr '\r' '\n' < filename.txt > filename_nolineend.txt

Conclusion

These steps allow the retrieval of the sequences of the best hit to a large number of initial sequences.