Difference between revisions of "RITA"

Revision as of 18:12, 4 March 2012

Rapid identification of taxonomic assignments (RITA)

This program executes and combines the results of several BLAST algorithms with those produced by the composition-based Naive Bayes algorithm. Sequences that are classified early in the pipeline are not executed with later algorithms which greatly reduced the total runtime.

This work has been submitted for publication:

MacDonald NJ, Parks DH, and Beiko RG. Rapid identification of taxonomic assignments. Submitted 2011.

If you have any questions or bug reports, please let us know at <beiko@cs.dal.ca>.

Web Server

For smaller datasets, taxonomic attributions can be obtained with the RITA web server.

License

RITA is released under the Creative Commons Share-Alike Attribution 3.0 License.

Downloads

RITA v1.0.0 RITA source code

Rank-specific Setup

Prerequisites: You must have BLAST+ 2.2.21 or higher installed.

Unzip RITA:

 > unzip RITA_v1_0_0.zip

Download, unzip, and run FCP install (FCP_install.py) with '--protein ncbi_genomes.faa' flag:

 > unzip FCP_1_0_3.zip -d ./FCP
 > cd FCP
 > python FCP_install.py --protein ncbi_genomes.faa

Concatenate the files in the training/sequences folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):

 > cat ./training/sequences/*.fasta > ncbi_genomes.fna

RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore, you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the

scripts directory (X can be 1 if you do not wish to split the files, we recommend X=10):

 > cd ../rita
 > mv ../FCP/ncbi_genomes.faa .
 > mv ../FCP/ncbi_genomes.fna . 
 > cd scripts
 > python splitfasta.py ../ncbi_genomes.fna 10
 > python splitfasta.py ../ncbi_genomes.faa 10

Create a nucleotide database with makeblastdb BLAST+ for each input ncbi_genomes.p*.fna:

 > makeblastdb -in "ncbi_genomes.p1.fna" -dbtype nucl
 > makeblastdb -in "ncbi_genomes.p2.fna" -dbtype nucl
 > ...
 > makeblastdb -in "ncbi_genomes.pX.fna" -dbtype nucl

Create a protein database with makeblastdb BLAST+ for each input ncbi_genomes.p*.faa.

 > makeblastdb -in "ncbi_genomes.p1.faa" -dbtype prot
 > makeblastdb -in "ncbi_genomes.p2.faa" -dbtype prot
 > ...
 > makeblastdb -in "ncbi_genomes.pX.faa" -dbtype prot

Set BLASTDB_PARTS to X in globalconfig.cfg and the name of the database to

<database_name>.p%%d (e.g. ncbi_genomes.p%%d.fna if this was your output BLAST database name) %%d is the placeholder for the database number identifier, 1..X.

Configure globalsettings.cfg appropriately for the above installation directories.

To use the UBLASTX classifier, you must also obtain a licensed copy of usearch and set up the configuration file appropriately.

Rank-specific example usage

 python rita.py --rank PHYLUM --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt

Note: Use the --jobid flag with a unique job identifier if running rita in parallel to ensure intermediate temporary files are not overwritten.

The above command will classify the fragments in 'fragments.fasta' at the rank of PHYLUM, using a pipeline that starts with the consensus of Naive Bayes and DCMEGABLAST, then the most confident DCMEGABLAST results, then the most confident Naive Bayes results and finally with the maximum likelihood NB prediction. Note that fragments are not attempted to be classified at a given step in the pipeline if they have already been classified at an earlier step (the order matters). See the Pipeline Components section below.

Rank-flexible setup

Prerequisites: Install BioPython (needed for tree manipulation). Install MOTHUR for 16S DNA alignments Install FastTree for building 16S trees

Follow the instructions for rank-specific RITA installation above.

Update globalsettings.cfg appropriately for the installation paths of MOTHUR and FastTree.

You must provide a 16S sequence for each genome in the database. To do this, create a 16S database from a confident source (e.g. RDP), and BLAST all of the complete genomes against it to acquire all of the positions of 16S matches within the genomes. Next use the get16s.py script to extract a single 16S sequence from each genome based on the best blast match.

Once you have a file with one 16S sequence per genome, build an alignment using MOTHUR and update the MOTHUR_16S_ALIGNMENT setting. You should now be able to use rank-flexible RITA.

The same labellers apply for the rank-flexible case.

Rank-flexible example usage

To run rank-flexible RITA, you must first generate a proxy file:

 python rita.py --buildproxy <sample_16s_fragments.fasta> --out proxy.txt

Then run rank-flexible RITA in the same way as rank-specific RITA, but specify the proxy file and the rank as FLEXIBLE

 python rita.py --proxy proxy.txt--rank FLEXIBLE --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt

For more information on how rank-flexible RITA works, please see the publication.

Pipeline Components

Included pipeline components (labellers) (specify with --pipeline A,B,C,...)

 NB_DCMEGABLAST     - labels fragments that agree at rank X for NB and DCMEGABLAST
 NB_BLASTN          - labels fragments that agree at rank X for NB and BLASTN
 NB_BLASTX          - labels fragments that agree at rank X for NB and BLASTX
 DCMEGABLAST_RATIO  - labels fragments that where the best DCMEGABLAST match evalue is at least Y times greater than the next best
 BLASTN_RATIO       - labels fragments that where the best BLASTN match evalue is at least Y times greater than the next best
 BLASTX_RATIO       - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
 NB_RATIO           - labels fragments that where the best NB match likelihood is at least Y times greater than the next best
 NB_ML              - labels fragments that with the best NB match (if there are no ties)
 NULL_LABELLER      - labels all remaining fragments with NONE (this should only ever be the last step in the pipeline).

Contact Information

Suggestions, comments, and bug reports can be sent to Rob Beiko (beiko [at] cs.dal.ca). If reporting a bug, please provide as much information as possible and a simplified version of the data set which causes the bug. This will allow us to quickly resolve the issue.

Funding

The development and deployment of RITA has been supported by several organizations:

Genome Atlantic
The Dalhousie Centre for Comparative Genomics and Evolutionary Bioinformatics, and the Tula Foundation
The Natural Sciences and Engineering Research Council of Canada
The Dalhousie Faculty of Computer Science

Difference between revisions of "RITA"

Revision as of 18:12, 4 March 2012

Contents

Rapid identification of taxonomic assignments (RITA)

Web Server

License

Downloads

Rank-specific Setup

Rank-specific example usage

Rank-flexible setup

Rank-flexible example usage

Pipeline Components

Contact Information

Funding

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 27: / Line 27: @@
 * Unzip RITA:
    > unzip RITA_v1_0_0.zip
-  > cd rita
 * Download, unzip, and run [http://kiwi.cs.dal.ca/Software/FCP FCP] install (FCP_install.py) with '--protein ncbi_genomes.faa' flag:
@@ Line 34: / Line 33: @@
    > python FCP_install.py --protein ncbi_genomes.faa
-* Concatenate the files in the training/species folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):
+* Concatenate the files in the training/sequences folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):
-   > cat ./training/species/*.fna > ncbi_genomes.fna
+   > cat ./training/sequences/*.fasta > ncbi_genomes.fna
-* RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the scripts directory (X can be 1 if you do not wish to split the files):
+* RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore, you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the
-   > python splitfasta.py ncbi_genomes.fna 10
+scripts directory (X can be 1 if you do not wish to split the files, we recommend X=10):
-   > python splitfasta.py ncbi_genomes.faa 10
+  > cd ../rita
+  > mv ../FCP/ncbi_genomes.faa .
+  > mv ../FCP/ncbi_genomes.fna .
+  > cd scripts
+   > python splitfasta.py ../ncbi_genomes.fna 10
+   > python splitfasta.py ../ncbi_genomes.faa 10
 * Create a nucleotide database with makeblastdb BLAST+ for each input ncbi_genomes.p*.fna:
-   >
+   > makeblastdb -in "ncbi_genomes.p1.fna" -dbtype nucl
+  > makeblastdb -in "ncbi_genomes.p2.fna" -dbtype nucl
+  > ...
+  > makeblastdb -in "ncbi_genomes.pX.fna" -dbtype nucl
 * Create a protein database with makeblastdb BLAST+ for each input ncbi_genomes.p*.faa.
-   >
+   > makeblastdb -in "ncbi_genomes.p1.faa" -dbtype prot
+  > makeblastdb -in "ncbi_genomes.p2.faa" -dbtype prot
+  > ...
+  > makeblastdb -in "ncbi_genomes.pX.faa" -dbtype prot
 * Set BLASTDB_PARTS to X in globalconfig.cfg and the name of the database to
-<database_name>.p%%d  (e.g. ncbi_genomes.nucleotide.p%%d.blast_db if this was your output BLAST database name)
+<database_name>.p%%d  (e.g. ncbi_genomes.p%%d.fna if this was your output BLAST database name)
 %%d is the placeholder for the database number identifier, 1..X.
@@ Line 64: / Line 74: @@
 consensus of Naive Bayes and DCMEGABLAST, then the most confident DCMEGABLAST results, then the most confident Naive Bayes results
 and finally with the maximum likelihood NB prediction.  Note that fragments are not attempted to be classified at
-a given step in the pipeline if they have already been classified at an earlier step (the order matters).
+a given step in the pipeline if they have already been classified at an earlier step (the order matters). See the Pipeline Components
+section below.
 == Rank-flexible setup ==
@@ Line 71: / Line 82: @@
 Install MOTHUR for 16S DNA alignments
 Install FastTree for building 16S trees
 Follow the instructions for rank-specific RITA installation above.