Difference between revisions of "RITA"

From Bioinformatics Software
Jump to navigationJump to search
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Rapid identification of taxonomic assignments (RITA) ==
+
== Overview of RITA ==
  
This program executes and combines the results of several BLAST algorithms with those produced
+
RITA is a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads. By combining homology predictions from BLAST or UBLAST with compositional classifications from a Naive Bayes classifier, RITA is able to achieve very high accuracy on short reads. Unlike other hybrid approaches which combine these predictions for all sequences to be classified, RITA uses a pipeline to first identify cases where both types of classifier are in agreement, which constitute the highest-confidence set. Sequences not classified in this manner are subjected to a series of downstream classification steps.  
by the composition-based Naive Bayes algorithm. Sequences that are classified early in the pipeline
 
are not executed with later algorithms which greatly reduced the total runtime.
 
  
This work has been submitted for publication:
+
This work has been accepted for publication:
  
MacDonald NJ, Parks DH, and Beiko RG.  Rapid identification of taxonomic assignments. Submitted 2011.
+
MacDonald NJ, Parks DH, and Beiko RG.  Rapid identification of taxonomic assignments. Accepted to ''Nucleic Acids Research'' April 4, 2012.
  
 
If you have any questions or bug reports, please let us know at <beiko@cs.dal.ca>.
 
If you have any questions or bug reports, please let us know at <beiko@cs.dal.ca>.
Line 19: Line 17:
  
 
== Downloads ==
 
== Downloads ==
* [[Media:RITA_v1_0_0.zip|RITA v1.0.0]] RITA source code
+
* [[Media:RITA_v1_0_1.zip|RITA v1.0.1]] RITA source code
  
 
== Rank-specific Setup ==
 
== Rank-specific Setup ==
  
 
''Prerequisites'': You must have [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ BLAST+ 2.2.21] or higher installed.
 
''Prerequisites'': You must have [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ BLAST+ 2.2.21] or higher installed.
 +
 +
* These instruction assume the following directory structure:
 +
<pre>
 +
/home/<user>/RITA/
 +
  FastTree/
 +
  FCP/
 +
  mothur/
 +
  rita/
 +
</pre>
 +
 +
which will contain installations of [http://www.microbesonline.org/fasttree/ FastTree], [http://kiwi.cs.dal.ca/Software/FCP FCP], [http://www.mothur.org/ mothur], and RITA, respectively.
  
 
* Unzip RITA:
 
* Unzip RITA:
   > unzip RITA_v1_0_0.zip
+
   > unzip RITA_v1_0_1.zip
  > cd rita
 
  
* Run FCP install (FCP_install.py) with '--protein ncbi_genomes.faa' flag:
+
* Download, unzip, and run [http://kiwi.cs.dal.ca/Software/FCP FCP] install (FCP_install.py) with '--protein ncbi_genomes.faa' flag:
 +
  > unzip FCP_1_0_3.zip -d ./FCP
 +
  > cd FCP
 
   > python FCP_install.py --protein ncbi_genomes.faa
 
   > python FCP_install.py --protein ncbi_genomes.faa
  
* Concatenate the files in the training/species folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):
+
* Concatenate the files in the training/sequences folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):
   > cat ./training/species/*.fna > ncbi_genomes.fna
+
   > cat ./training/sequences/*.fasta > ncbi_genomes.fna
 
+
 
* RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the scripts directory (X can be 1 if you do not wish to split the files):
+
* RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore, you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the  
   > python splitfasta.py ncbi_genomes.fna 10
+
scripts directory (X can be 1 if you do not wish to split the files, we recommend X=10):
   > python splitfasta.py ncbi_genomes.faa 10
+
  > cd ../rita
 +
  > mv ../FCP/ncbi_genomes.faa .
 +
  > mv ../FCP/ncbi_genomes.fna .
 +
  > cd scripts
 +
   > python splitfasta.py ../ncbi_genomes.fna 10
 +
   > python splitfasta.py ../ncbi_genomes.faa 10
  
 
* Create a nucleotide database with makeblastdb BLAST+ for each input ncbi_genomes.p*.fna:
 
* Create a nucleotide database with makeblastdb BLAST+ for each input ncbi_genomes.p*.fna:
   >
+
   > makeblastdb -in "ncbi_genomes.p1.fna" -dbtype nucl
 +
  > makeblastdb -in "ncbi_genomes.p2.fna" -dbtype nucl
 +
  > ...
 +
  > makeblastdb -in "ncbi_genomes.pX.fna" -dbtype nucl
  
 
* Create a protein database with makeblastdb BLAST+ for each input ncbi_genomes.p*.faa.
 
* Create a protein database with makeblastdb BLAST+ for each input ncbi_genomes.p*.faa.
   >
+
   > makeblastdb -in "ncbi_genomes.p1.faa" -dbtype prot
 +
  > makeblastdb -in "ncbi_genomes.p2.faa" -dbtype prot
 +
  > ...
 +
  > makeblastdb -in "ncbi_genomes.pX.faa" -dbtype prot
  
 
* Set BLASTDB_PARTS to X in globalconfig.cfg and the name of the database to
 
* Set BLASTDB_PARTS to X in globalconfig.cfg and the name of the database to
<database_name>.p%%d  (e.g. ncbi_genomes.nucleotide.p%%d.blast_db if this was your output BLAST database name)
+
<database_name>.p%%d  (e.g. ncbi_genomes.p%%d.fna if this was your output BLAST database name)
 
%%d is the placeholder for the database number identifier, 1..X.
 
%%d is the placeholder for the database number identifier, 1..X.
  
* Configure globalsettings.cfg appropriately for the above installation directories.
+
* Configure ''globalsettings.cfg'' appropriately for the above installation directories.
  
 
* To use the UBLASTX classifier, you must also obtain a licensed copy of usearch and set up the configuration file appropriately.
 
* To use the UBLASTX classifier, you must also obtain a licensed copy of usearch and set up the configuration file appropriately.
Line 62: Line 83:
 
consensus of Naive Bayes and DCMEGABLAST, then the most confident DCMEGABLAST results, then the most confident Naive Bayes results
 
consensus of Naive Bayes and DCMEGABLAST, then the most confident DCMEGABLAST results, then the most confident Naive Bayes results
 
and finally with the maximum likelihood NB prediction.  Note that fragments are not attempted to be classified at  
 
and finally with the maximum likelihood NB prediction.  Note that fragments are not attempted to be classified at  
a given step in the pipeline if they have already been classified at an earlier step (the order matters).
+
a given step in the pipeline if they have already been classified at an earlier step (the order matters). See the Pipeline Components
 +
section below.
  
 
== Rank-flexible setup ==
 
== Rank-flexible setup ==
Prerequisites:
+
''Prerequisites'':
Install BioPython (needed for tree manipulation).
+
* Install [http://biopython.org/wiki/Biopython BioPython] (needed for tree manipulation).
Install MOTHUR for 16S DNA alignments
+
* Install [http://www.mothur.org/ MOTHUR] for 16S DNA alignments.
Install FastTree for building 16S trees
+
* Install [http://www.microbesonline.org/fasttree/ FastTree] for building 16S trees.
  
 +
To configure RITA for rank-flexible classifications follow these steps:
  
Follow the instructions for rank-specific RITA installation above.
+
* Follow the instructions for a rank-specific RITA installation given above.
 +
* Update the MOTHUR and FastTree installation paths in ''globalsettings.cfg''.
 +
* Build a trusted BLAST database of 16S sequences. We recommend using the hand-curated sequences from [http://rdp.cme.msu.edu/ RDP].
 +
** Download and extract the unaligned Bacteria and Archaea sequences from RDP ([http://rdp.cme.msu.edu/download/release10_28_unaligned.fa.gz link]) into a directory called ''RDP'':
 +
  > gunzip release10_28_unaligned.fa.gz
 +
** Create the BLAST database:
 +
  > makeblastdb -in "release10_28_unaligned.fa" -dbtype nucl
 +
* From the ''rita'' directory, BLAST the complete genomes against the 16S database:
 +
  > blastn -query ncbi_genomes.fna -db ../RDP/release10_28_unaligned.fa -out ncbi_genomes_16S.blast.txt -evalue 1e-10 -outfmt 6
 +
* Use the ''get16s.py'' script to extract a single 16S sequence from each genome based on the best BLAST match:
 +
  > cd ./scripts
 +
  > python get16s.py ../ncbi_genomes_16S.blast.txt ../../FCP/training/sequences ../../FCP/taxonomy.txt
 +
  > mv sequences_of_16s.fasta ../
 +
* The above script will produce the file ''sequences_of_16s.fasta'' which must be align. This can be done with MOTHUR using the following command:
 +
  mothur > set.dir(input=../rita)
 +
  mothur > set.dir(output=../rita)
 +
  mothur > align.seqs(candidate=sequences_of_16s.fasta, template=core_set_aligned.imputed.fasta, flip=t)
 +
  mothur > quit()
 +
* Place a copy of the [http://www.mothur.org/wiki/Lane_mask 1349 character Lane Mask] in your ''mothur'' directory.
 +
** Note: core_set_aligned.imputed.fasta can be obtained from the MOTHUR [http://www.mothur.org/wiki/Greengenes-formatted_databases here].
 +
* Update the MOTHUR_16S_ALIGNMENT setting in ''globalsettings.cfg'' to point to the file ''sequences_of_16s.align'' which will be in the ''rita'' directory.
  
Update globalsettings.cfg appropriately for the installation paths of MOTHUR and FastTree.
+
You are now ready to use rank-flexible RITA.
 
 
You must provide a 16S sequence for each genome in the database.  To do this, create a 16S
 
database from a confident source (e.g. RDP), and BLAST all of the complete genomes against it to
 
acquire all of the positions of 16S matches within the genomes.  Next use the get16s.py script to
 
extract a single 16S sequence from each genome based on the best blast match.
 
 
 
Once you have a file with one 16S sequence per genome, build an alignment using MOTHUR and update the
 
MOTHUR_16S_ALIGNMENT setting.  You should now be able to use rank-flexible RITA.
 
 
 
The same labellers apply for the rank-flexible case.
 
  
 
== Rank-flexible example usage ==
 
== Rank-flexible example usage ==
To run rank-flexible RITA, you must first generate a proxy file:
+
To run rank-flexible RITA, you must first generate a proxy file for the 16S sequences contained in your sample:
 
   python rita.py --buildproxy <sample_16s_fragments.fasta> --out proxy.txt
 
   python rita.py --buildproxy <sample_16s_fragments.fasta> --out proxy.txt
  
 
Then run rank-flexible RITA in the same way as rank-specific RITA, but specify the proxy file and the rank as FLEXIBLE
 
Then run rank-flexible RITA in the same way as rank-specific RITA, but specify the proxy file and the rank as FLEXIBLE
   python rita.py --proxy proxy.txt--rank FLEXIBLE --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt
+
   python rita.py --proxy proxy.txt --rank FLEXIBLE --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt
  
 
For more information on how rank-flexible RITA works, please see the publication.
 
For more information on how rank-flexible RITA works, please see the publication.
Line 98: Line 131:
 
Included pipeline components (labellers) (specify with --pipeline A,B,C,...)
 
Included pipeline components (labellers) (specify with --pipeline A,B,C,...)
  
 +
<pre>
 
   NB_DCMEGABLAST    - labels fragments that agree at rank X for NB and DCMEGABLAST
 
   NB_DCMEGABLAST    - labels fragments that agree at rank X for NB and DCMEGABLAST
 
   NB_BLASTN          - labels fragments that agree at rank X for NB and BLASTN
 
   NB_BLASTN          - labels fragments that agree at rank X for NB and BLASTN
 
   NB_BLASTX          - labels fragments that agree at rank X for NB and BLASTX
 
   NB_BLASTX          - labels fragments that agree at rank X for NB and BLASTX
 +
  NB_UBLASTX        - labels fragments that agree at rank X for NB and UBLASTX
 +
 
 
   DCMEGABLAST_RATIO  - labels fragments that where the best DCMEGABLAST match evalue is at least Y times greater than the next best
 
   DCMEGABLAST_RATIO  - labels fragments that where the best DCMEGABLAST match evalue is at least Y times greater than the next best
 
   BLASTN_RATIO      - labels fragments that where the best BLASTN match evalue is at least Y times greater than the next best
 
   BLASTN_RATIO      - labels fragments that where the best BLASTN match evalue is at least Y times greater than the next best
 
   BLASTX_RATIO      - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
 
   BLASTX_RATIO      - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
 +
  UBLASTX_RATIO      - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
 
   NB_RATIO          - labels fragments that where the best NB match likelihood is at least Y times greater than the next best
 
   NB_RATIO          - labels fragments that where the best NB match likelihood is at least Y times greater than the next best
 +
 
 
   NB_ML              - labels fragments that with the best NB match (if there are no ties)
 
   NB_ML              - labels fragments that with the best NB match (if there are no ties)
 
   NULL_LABELLER      - labels all remaining fragments with NONE (this should only ever be the last step in the pipeline).
 
   NULL_LABELLER      - labels all remaining fragments with NONE (this should only ever be the last step in the pipeline).
 +
</pre>
 +
 +
== RITA Parameters ==
 +
 +
''rita.py'' accepts the following command-line parameters:
 +
 +
<pre>
 +
--help        Provides a description of accepted command-line parameters.
 +
 +
--pipeline    Specify the components of the pipeline.
 +
--rank        Taxonomic rank to classify at.
 +
 +
--blastne      BLASTN E-value threshold.
 +
--dblastne    Discontiguous MegaBLASTN E-value threshold.
 +
--blastxe      BLASTX E-value threshold.
 +
--ublastxe    UBLASTX (usearch) E-value threshold.
 +
 +
--blastnratio  BLASTN E-value ratio.
 +
--dblastnratio Discontiguous MegaBLASTN E-value ratio.
 +
--blastxratio  BLASTX E-value ratio.
 +
--ublastxratio UBLASTX (usearch) E-value ratio.
 +
--nb_ratio    NB Likelihood ratio.
 +
 +
--query        FASTA file with query sequences.
 +
--out          Output filename.
 +
 +
--jobid        Specify a job number. Default is a random 4 digit identifier.
 +
--buildproxy  Build a proxy for rank-flexible classifications with the provided 16S sequences.
 +
--proxy        Proxy for rank-flexible classifications created with --buildproxy.
 +
</pre>
  
 
== Contact Information ==
 
== Contact Information ==

Latest revision as of 11:18, 4 April 2012

Overview of RITA

RITA is a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads. By combining homology predictions from BLAST or UBLAST with compositional classifications from a Naive Bayes classifier, RITA is able to achieve very high accuracy on short reads. Unlike other hybrid approaches which combine these predictions for all sequences to be classified, RITA uses a pipeline to first identify cases where both types of classifier are in agreement, which constitute the highest-confidence set. Sequences not classified in this manner are subjected to a series of downstream classification steps.

This work has been accepted for publication:

MacDonald NJ, Parks DH, and Beiko RG. Rapid identification of taxonomic assignments. Accepted to Nucleic Acids Research April 4, 2012.

If you have any questions or bug reports, please let us know at <beiko@cs.dal.ca>.

Web Server

For smaller datasets, taxonomic attributions can be obtained with the RITA web server.

License

RITA is released under the Creative Commons Share-Alike Attribution 3.0 License.

Downloads

Rank-specific Setup

Prerequisites: You must have BLAST+ 2.2.21 or higher installed.

  • These instruction assume the following directory structure:
/home/<user>/RITA/
  FastTree/
  FCP/
  mothur/
  rita/

which will contain installations of FastTree, FCP, mothur, and RITA, respectively.

  • Unzip RITA:
 > unzip RITA_v1_0_1.zip
  • Download, unzip, and run FCP install (FCP_install.py) with '--protein ncbi_genomes.faa' flag:
 > unzip FCP_1_0_3.zip -d ./FCP
 > cd FCP
 > python FCP_install.py --protein ncbi_genomes.faa
  • Concatenate the files in the training/sequences folder of FCP into a single input nucleotide file (e.g. ncbi_genomes.fna):
 > cat ./training/sequences/*.fasta > ncbi_genomes.fna
  • RITA is designed to run over multiple BLAST databases in order to reduce memory consumption. Therefore, you should split both the nucleotide and protein files into X pieces using the splitfasta.py script in the

scripts directory (X can be 1 if you do not wish to split the files, we recommend X=10):

 > cd ../rita
 > mv ../FCP/ncbi_genomes.faa .
 > mv ../FCP/ncbi_genomes.fna . 
 > cd scripts
 > python splitfasta.py ../ncbi_genomes.fna 10
 > python splitfasta.py ../ncbi_genomes.faa 10
  • Create a nucleotide database with makeblastdb BLAST+ for each input ncbi_genomes.p*.fna:
 > makeblastdb -in "ncbi_genomes.p1.fna" -dbtype nucl
 > makeblastdb -in "ncbi_genomes.p2.fna" -dbtype nucl
 > ...
 > makeblastdb -in "ncbi_genomes.pX.fna" -dbtype nucl
  • Create a protein database with makeblastdb BLAST+ for each input ncbi_genomes.p*.faa.
 > makeblastdb -in "ncbi_genomes.p1.faa" -dbtype prot
 > makeblastdb -in "ncbi_genomes.p2.faa" -dbtype prot
 > ...
 > makeblastdb -in "ncbi_genomes.pX.faa" -dbtype prot
  • Set BLASTDB_PARTS to X in globalconfig.cfg and the name of the database to

<database_name>.p%%d (e.g. ncbi_genomes.p%%d.fna if this was your output BLAST database name) %%d is the placeholder for the database number identifier, 1..X.

  • Configure globalsettings.cfg appropriately for the above installation directories.
  • To use the UBLASTX classifier, you must also obtain a licensed copy of usearch and set up the configuration file appropriately.

Rank-specific example usage

 python rita.py --rank PHYLUM --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt

Note: Use the --jobid flag with a unique job identifier if running rita in parallel to ensure intermediate temporary files are not overwritten.

The above command will classify the fragments in 'fragments.fasta' at the rank of PHYLUM, using a pipeline that starts with the consensus of Naive Bayes and DCMEGABLAST, then the most confident DCMEGABLAST results, then the most confident Naive Bayes results and finally with the maximum likelihood NB prediction. Note that fragments are not attempted to be classified at a given step in the pipeline if they have already been classified at an earlier step (the order matters). See the Pipeline Components section below.

Rank-flexible setup

Prerequisites:

  • Install BioPython (needed for tree manipulation).
  • Install MOTHUR for 16S DNA alignments.
  • Install FastTree for building 16S trees.

To configure RITA for rank-flexible classifications follow these steps:

  • Follow the instructions for a rank-specific RITA installation given above.
  • Update the MOTHUR and FastTree installation paths in globalsettings.cfg.
  • Build a trusted BLAST database of 16S sequences. We recommend using the hand-curated sequences from RDP.
    • Download and extract the unaligned Bacteria and Archaea sequences from RDP (link) into a directory called RDP:
 > gunzip release10_28_unaligned.fa.gz
    • Create the BLAST database:
 > makeblastdb -in "release10_28_unaligned.fa" -dbtype nucl
  • From the rita directory, BLAST the complete genomes against the 16S database:
 > blastn -query ncbi_genomes.fna -db ../RDP/release10_28_unaligned.fa -out ncbi_genomes_16S.blast.txt -evalue 1e-10 -outfmt 6
  • Use the get16s.py script to extract a single 16S sequence from each genome based on the best BLAST match:
 > cd ./scripts
 > python get16s.py ../ncbi_genomes_16S.blast.txt ../../FCP/training/sequences ../../FCP/taxonomy.txt
 > mv sequences_of_16s.fasta ../
  • The above script will produce the file sequences_of_16s.fasta which must be align. This can be done with MOTHUR using the following command:
 mothur > set.dir(input=../rita)
 mothur > set.dir(output=../rita)
 mothur > align.seqs(candidate=sequences_of_16s.fasta, template=core_set_aligned.imputed.fasta, flip=t)
 mothur > quit()
  • Place a copy of the 1349 character Lane Mask in your mothur directory.
    • Note: core_set_aligned.imputed.fasta can be obtained from the MOTHUR here.
  • Update the MOTHUR_16S_ALIGNMENT setting in globalsettings.cfg to point to the file sequences_of_16s.align which will be in the rita directory.

You are now ready to use rank-flexible RITA.

Rank-flexible example usage

To run rank-flexible RITA, you must first generate a proxy file for the 16S sequences contained in your sample:

 python rita.py --buildproxy <sample_16s_fragments.fasta> --out proxy.txt

Then run rank-flexible RITA in the same way as rank-specific RITA, but specify the proxy file and the rank as FLEXIBLE

 python rita.py --proxy proxy.txt --rank FLEXIBLE --pipeline NB_DCMEGABLAST,DCMEGABLAST_RATIO,NB_RATIO,NB_ML --query fragments.fasta --out results.txt

For more information on how rank-flexible RITA works, please see the publication.

Pipeline Components

Included pipeline components (labellers) (specify with --pipeline A,B,C,...)

  NB_DCMEGABLAST     - labels fragments that agree at rank X for NB and DCMEGABLAST
  NB_BLASTN          - labels fragments that agree at rank X for NB and BLASTN
  NB_BLASTX          - labels fragments that agree at rank X for NB and BLASTX
  NB_UBLASTX         - labels fragments that agree at rank X for NB and UBLASTX
  
  DCMEGABLAST_RATIO  - labels fragments that where the best DCMEGABLAST match evalue is at least Y times greater than the next best
  BLASTN_RATIO       - labels fragments that where the best BLASTN match evalue is at least Y times greater than the next best
  BLASTX_RATIO       - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
  UBLASTX_RATIO      - labels fragments that where the best BLASTX match evalue is at least Y times greater than the next best
  NB_RATIO           - labels fragments that where the best NB match likelihood is at least Y times greater than the next best
  
  NB_ML              - labels fragments that with the best NB match (if there are no ties)
  NULL_LABELLER      - labels all remaining fragments with NONE (this should only ever be the last step in the pipeline).

RITA Parameters

rita.py accepts the following command-line parameters:

--help         Provides a description of accepted command-line parameters.

--pipeline     Specify the components of the pipeline.
--rank         Taxonomic rank to classify at.

--blastne      BLASTN E-value threshold.
--dblastne     Discontiguous MegaBLASTN E-value threshold.
--blastxe      BLASTX E-value threshold.
--ublastxe     UBLASTX (usearch) E-value threshold.

--blastnratio  BLASTN E-value ratio.
--dblastnratio Discontiguous MegaBLASTN E-value ratio.
--blastxratio  BLASTX E-value ratio.
--ublastxratio UBLASTX (usearch) E-value ratio.
--nb_ratio     NB Likelihood ratio.

--query        FASTA file with query sequences.
--out          Output filename.

--jobid        Specify a job number. Default is a random 4 digit identifier.
--buildproxy   Build a proxy for rank-flexible classifications with the provided 16S sequences.
--proxy        Proxy for rank-flexible classifications created with --buildproxy.

Contact Information

Suggestions, comments, and bug reports can be sent to Rob Beiko (beiko [at] cs.dal.ca). If reporting a bug, please provide as much information as possible and a simplified version of the data set which causes the bug. This will allow us to quickly resolve the issue.

Funding

The development and deployment of RITA has been supported by several organizations: