Express Beta Diversity (EBD)
============================
Taxon- and phylogenetic-based beta diversity measures.
-------------------------------------------------------------------------------
EBD is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
EBD is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with EBD. If not, see .
Installation:
-------------------------------------------------------------------------------
EBD is a command-line program written in C++. To install EBD, download
and uncompress it with the unzip command:
unzip EBD_1_0_4.zip
To compile EBD on OSX or Linux simply type 'make' from within the source
directory of EBD. The resulting executable will be in the bin directory.
A precompiled executables for Windows is provided in the bin directory.
Please note that under Windows, EBD must be run from the command-line
(i.e., the DOS prompt).
Program usage:
-------------------------------------------------------------------------------
Usage: EBD [OPTIONS]
Calculates taxon- and phylogenetic-basec beta diversity measures.
Options:
-h, --help Produce help message.
-l, --list-calc List all supported calculators.
-u, --unit-tests Execute unit tests.
-t, --tree-file Tree in Newick format (if phylogenetic beta-diversity is desired).
-s, --seq-count-file Sequence count file.
-p, --output-prefix Output prefix.
-g, --clustering Hierarchical clustering method: UPGMA, SingleLinkage, CompleteLinkage, NJ (default = UPGMA).
-j, --jackknife Number of jackknife replicates to perform (default = 0).
-d, --seqs-to-draw Number of sequence to draw for jackknife replicates.
-z, --sample-size Print number of sequences in each sample.
-c, --calculator Desired calculator (e.g., Bray-Curtis, Canberra).
-w, --weighted Indicated if sequence abundance data should be used.
-m, --mrca Apply 'MRCA weightings' to each branch (experimental).
-r, --strict-mrca Restrict calculator to MRCA subtree.
-y, --count Use count data as opposed to relative proportions.
-x, --max-data-vecs Maximum number of profiles (data vectors) to have in memory at once (default = 1000).
-a, --all Apply all calculators and cluster calculators at the specified threshold.
-b, --threshold Correlation threshold for clustering calculators (default = 0.8).
-o, --output-file Output file for cluster of calculators (default = clusters.txt).
-v, --verbose Provide additional information on program execution.
Example of applying a specific calculator:
./ExpressBetaDiversity -t input.tre -s seq.txt -p bray_curtis -c Bray-Curtis -w
which will result in two output files, the raw dissimilarity matrix in bray_curtis.diss
and a UPGMA hierarchical cluster tree in bray_curtis.tre.
Example of querying number of sequences in each sample:
./ExpressBetaDiversity -s seq.txt -z
which will result in the number of sequences in each sample being written to standard out.
Example of applying a specific calculator with jackknife replicates:
./ExpressBetaDiversity -t input.tre -s seq.txt -p bray_curtis -c Bray-Curtis -w -j 100 -d 500
which will result in two output files, the raw dissimilarity matrix in bray_curtis.diss
and a UPGMA hierarchical cluster tree in bray_curtis.tre with jackknife support values.
Example of applying all calculators and clustering these based on their Pearson correlation:
./ExpressBetaDiversity -t input.tre -s seq.txt -a -b 0.9 -o clusters.txt
which will result in the output file clusters.txt (see file format below).
Verifying software installation:
-------------------------------------------------------------------------------
A set of unit tests is included to verify proper installation of the EBD
software. The unit tests can be run with:
./ExpressBetaDiversity -u
The software should not be used if any of the unit tests fail.
Input file formats:
-------------------------------------------------------------------------------
EBD uses Newick formatted trees as input. Information on this tree format can
be found at: http://evolution.genetics.washington.edu/phylip/newicktree.html.
Here is a simple Newick tree with three leaf nodes labelled A, B, and C:
(A:1,(B:1,C:1):1);
Taxon-based beta-diversity is calculated if an input tree is not specified.
Sequence count information must be specified as a tab-delimited table where
each row is a sample and each column is the name of a leaf node in the provided
tree. Data must be provided for all leaf nodes in the tree. Consider the
following example:
A B C
Sample1 1 2 3
Sample2 10 1 0
Sample3 0 0 1
The first row begins indicates each leaf node in the tree seperated by a tab.
Please note that this line MUST start with a tab. The number of sequences
associated with each leaf node is then indicated for each sample on a seperate
row. In this example, the first sample is labelled 'Sample1' and contains 1
instance of sequence/OTU A, 2 instances of B, and 3 instances of C. Sample3
contains only instances of C, but note that zeros must be specified for the
other sequence/OTU types.
Example input files are avaliable in the unit-tests directory.
Converting from QIIME/UniFrac file formats:
-------------------------------------------------------------------------------
The script convertToEBD.py in the scripts directory can be used to
convert sparse or dense UniFrac-style OTU tables into the format
required by EBD. The UniFrac format is used by many popular services
including the UniFrac web services and QIIME. EBD uses a different
input file format in order to efficently handle data sets consisting
of thousands of samples. The script can be run as follows:
./convertUniFracToEBD.py
Dissimilarity output file format:
-------------------------------------------------------------------------------
The resulting dissimilarity between samples is written as a tab-delimited,
lower-triangular dissimilarity matrix with the first line indicating the number
of samples. Consider the following output:
3
A
B 1
C 2 3
The first line indicates that there are 3 samples. The dissimilarity between
samples A and B is 1, A and C is 2, and B and C is 3.
Clustering output file format:
-------------------------------------------------------------------------------
The clustering file indicates clusters of calculators which are correlated. The
clustering threshold is specified by the user with the --threshold (-b)
parameter. All calculators in a cluster will be at least as correlated as the
specified threshold. Results are reported as follows:
Minimum r Calculators
[0.0] uChi-squared;
[0.86] Canberra;CS;uCanberra;uCS;uGower;uManhattan;
[0.91] uBray-Curtis;uSoergel;uKulczynski;
[0.81] Bray-Curtis;Kulczynski;Soergel;
...
Complete linkage cluster tree (branch lengths are 1 - Pearson's correlation):
((('Bray-Curtis':5.60596e-006,'Kulczynski':5.60596e-006):4.13975e-005 ...
The first line indicates the column headers. Each subsequent line indicates a
cluster of calculators. The number within the brackets indicates that minimum
Pearson's correlation between any pair of calculators in the cluster. A
semicolon seperated list indicates which calculators are in the cluster. The
last line of the file gives the complete linkage tree used to cluster measures.
This can be copied into a seperate file and visualized in any program which
can read a Newick tree file.
The dissimilarity matrix for calculator X is saved to the file 'X.cluster.diss'
within the same directory as the EBD executable.
Citing EBD:
-------------------------------------------------------------------------------
If you use EBD in your research, please cite:
Parks, D.H. and Beiko, R.G. 2013. Measures of phylogenetic differentiation provide
robust and complementary insights into microbial communities. ISME J, 7:173-83.
Contact Information:
-------------------------------------------------------------------------------
Donovan Parks
donovan.parks@gmail.com
Robert Beiko
beiko@cs.dal.ca
Program website: http://kiwi.cs.dal.ca/Software/EBD