RSPR README
From Bioinformatics Software
Jump to navigationJump to search
################################################################################ rspr ################################################################################ Usage: rspr [OPTIONS] Calculate approximate and exact Subtree Prune and Regraft (rSPR) distances and the associated maximum agreement forests (MAFs) between pairs of rooted binary trees from STDIN in newick format. Supports arbitrary labels. The second tree may be multifurcating. Can also compare the first input tree to each other tree with -total or compute a pairwise distance matrix with -pairwise. Copyright 2009-2014 Chris Whidden whidden@cs.dal.ca http://kiwi.cs.dal.ca/Software/RSPR April 29, 2014 Version 1.2.2 This file is part of rspr. rspr is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. rspr is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with rspr. If not, see <http://www.gnu.org/licenses/>. ******************************************************************************* ALGORITHM ******************************************************************************* These options control what algorithm is used -fpt Calculate the exact rSPR distance with an FPT algorithm -bb Calculate the exact rSPR distance with a branch-and-bound FPT algorithm. This is enabled by default. -approx Calculate just a linear-time 3-approximation of the rSPR distance -split_approx -split_approx x Calculate the exact rSPR distance if it is k or less and otherwise use the exponential-time approximation -cluster_test Use the cluster reduction to speed up the exact algorithm. This is enabled by default. -total Find the total SPR distance from the first input tree to the rest of the list of trees. Uses the other algorithm options as specified (including unrooted options). ******************************************************************************* OPTIMIZATIONS ******************************************************************************* These options control the use of optimized branching. All optimizations are enabled by default. Specifying any subset of -cob, -cab, and -sc will use just that subset of optimizations. -allopt Use -cob -cab -sc and a new set of improvements. This is the default option -noopt Use 3-way branching for all FPT algorithms -cob Use "cut one b" improved branching -cab Use "cut all b" improved branching -sc Use "separate components" improved branching ******************************************************************************* MULTIFURCATING COMPARISON OPTIONS ******************************************************************************* -support x Collapse bipartitions with less than x support ******************************************************************************* UNROOTED COMPARISON OPTIONS ******************************************************************************* -unrooted Compare the first input tree to each other input tree. Output the best found distance and agreement forest. This option can be used with gen_rooted_trees.pl to provide the rootings. Note that this option is a bit unintuitive to maintain compatibility with previous versions of rSPR. If -total or -pairwise analysis is used then there is no need to specify rootings. -unrooted_min_approx Compare the first input tree to each other input tree. Run the exact algorithms on the pair with the minimum approximate rspr distance -simple_unrooted Root the gene trees using a bipartition balanced accuracy measure (fast but potentially less accurate). Only used with -total. ******************************************************************************* PAIRWISE COMPARISON OPTIONS ******************************************************************************* -pairwise -pairwise a b -pairwise a b c d Compare each input tree to each other tree and output the resulting SPR distance matrix. If -unrooted is enabled this will compute the "best rooting" SPR distance by testing each rooting of the trees. The optional arguments a b c d compute only rows a-b and/or columns c-d of the matrix. -no-symmetric-pairwise By default, -pairwise will ignore the symmetric lower left triangle of the matrix. With this option the lower triangle is filled in. -pairwise_max x Use with -pairwise to only compute distances at most x. Larger values are output as -1. Very efficient for small distances (e.g. 1-10). ******************************************************************************* OTHER OPTIONS ******************************************************************************* -cc Calculate a potentially better approximation with a quadratic time algorithm -q Quiet; Do not output the input trees or approximation ******************************************************************************* Example: $ ./rspr.exe -fpt <test_trees/trees2.txt T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16)))) T2: (((7,8),((1,(2,(14,5))),(3,4))),(((11,(6,12)),10),((13,(15,16)),9))) F1: (((1,2),(3,4)),(7,8)) 14 13 5 12 11 6 9 10 (15,16) F2: ((7,8),((1,2),(3,4))) 14 5 13 12 6 11 9 (15,16) 10 approx drSPR=9 3 4 F1: ((((1,2),(3,4)),(7,8)),((10,(11,12)),(13,(15,16)))) 14 6 9 5 F2: (((7,8),((1,2),(3,4))),(((11,12),10),(13,(15,16)))) 14 6 9 5 exact drSPR=4 ################################################################################ CONTACT INFORMATION Chris Whidden whidden@cs.dal.ca http://kiwi.cs.dal.ca/Software/RSPR ################################################################################ FILES ClusterForest.h Cluster Decomposition Forest.h Forest data structure gen_rooted_trees.pl Generate all rootings of an unrooted binary tree gpl.txt The GPL license LCA.h Compute LCAs of tree leaves Makefile Makefile Node.h Node data structure README.txt This README rspr.h Library to calculate rSPR distances between pairs of trees rspr.cpp Calculate rSPR distances between pairs or sets of trees test_trees/ Folder of test tree pairs SiblingPair.h Sibling Pair class ################################################################################ INSTALLATION rSPR is a command-line program written in C++. To use it, simply compile rspr.cpp and execute the resulting program. On systems with the g++ compiler and make program, the included make file will compile rspr; simply run `make'. ################################################################################ INPUT rSPR requires pairs of Newick format trees with arbitrary labels as input. The first tree must be binary and rooted. The second tree may be multifurcating and rooted. A sample Newick tree is shown below: ((1,2),(3,4),(5,6)); rSPR can also compare a rooted reference tree to an unrooted test tree. First use gen_rooted_trees.pl to generate all rootings of the unrooted test tree. Then use the -unrooted or -unrooted_min_approx options and input the test tree and the set of rootings. rSPR will find the best rooting of the test tree with the -unrooted option and guess the best rooting based on the approximation algorithm with the -unrooted_min_approx option. Alternatively, the -total option with the -unrooted or -unrooted_min_approx options will provide just the distance. The -total option with -simple_unrooted will use a faster biparition based measure to approximate the optimal rooting. The -support x option can be used to collapse poorly supported branches of the second tree. With the -pairwise option, rSPR will compare each pair of input trees and output the results as a distance matrix. To save time, only the upper right triangle is output as the lower left triangle is symmetric. Use the included fill_matrix program to fill in missing values or the -no-symmetric-pairwise option to explicitly compute these values. Optional arguments to -pairwise can be used to compute subsets of the matrix (e.g. for partitioning computation over multiple processes). The -pairwise_max x option can be used to quickly find trees with SPR distance at most x when x is small (e.g. 1-10). ################################################################################ OUTPUT rspr writes to standard output. A sample command line and output are shown below: ///////////////////// $ ./rspr < test_trees/trees2.txt T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16)))) T2: ((((3,4),(8,(2,((11,12),1)))),((15,16),(7,(6,5)))),(14,((10,13),9))) F1: ((3,4),(5,6)) 13 14 10 (11,12) 9 1 8 7 2 (15,16) F2: ((3,4),(6,5)) 13 10 14 (11,12) 1 9 8 2 7 (15,16) approx drSPR=12 4 F1: ((((1,2),(3,4)),((5,6),7)),((9,10),14)) 13 (11,12) 8 (15,16) F2: ((((3,4),(2,1)),(7,(6,5))),(14,(10,9))) 13 (11,12) 8 (15,16) exact BB drSPR=4 ///////////////////// The first set of lines show the input trees. The second set of lines are the approximate agreement forests and the corresponding approximate rSPR distance. The third set of lines are the maximum agreement forests and the corresponding exact rSPR distance. When calculating exact distances, the distance currently being considered is printed on the first line of this section. Each component of an agreement forest corresponds to an rSPR operation. The set of rSPR operations required to turn one tree into the other can be found by applying rSPR operations that move these components to their correct place in the other tree. An agreement forest may contain p (rho) as a component. This represents the root of the trees and indicates that an extra rSPR operation is required to correctly root the tree. ################################################################################ OUTPUT WITH CLUSTERING ///////////////////// $ rspr < test_trees/cluster_test T1: (((x,((b1,b3),b2)),y),(f,(a,c))) T2: (((x,y),f),((a,((b1,b2),b3)),c)) F1: (((0,((1,2),3)),4),(5,(6,7))) F2: (((0,4),5),((6,((1,3),2)),7)) approx drSPR=9 CLUSTERS C1_1: ((1,2),3) C1_2: ((1,3),2) cluster approx drSPR=3 1 F1_1: (1,2) 3 F1_2: (1,2) 3 cluster exact drSPR=1 C2_1: (((0,(1,2)),4),(5,(6,7))) C2_2: (((0,4),5),((6,(1,2)),7)) cluster approx drSPR=6 2 F2_1: (5,(6,7)) (1,2) (0,4) F2_2: (5,(6,7)) (1,2) (0,4) cluster exact drSPR=2 F1: (f,(a,c)) b2 (b1,b3) (x,y) F2: (f,(a,c)) b2 (b1,b3) (x,y) total exact drSPR=3 ///////////////////// When clustering is enabled (as it is by default), each solved cluster is displayed along with its approximate and exact distance in an intermediate representation with labels mapped from 0-(N-1) where N is the number of labels. The final agreement forest and distance are output last. ################################################################################ OUTPUT WITH PAIRWISE $ cat test_trees/big_test* | rspr -pairwise 0,46,0,46 ,0,46,50 ,,0,46 ,,,0 cat test_trees/big_test* | rspr -pairwise | fill_matrix 0,46,0,46 46,0,46,50 0,46,0,46 46,50,46,0 ################################################################################ EFFICIENCY The 3-approximation algorithm runs in O(n) time, where n is the number of leaves in the trees. The unoptimized FPT and branch-and-bound algorithms run in O(3^k n) time, where k is the rSPR distance and n is the number of leaves in the trees. The branch-and-bound algorithm should be significantly faster in practice. Using all 3 of the -cob -cab and -sc optimizations improves the running times of the algorithms to O(2.42^k n) time. This provides a significant improvement in practice and is provably correct, thus this is the default. In addition, this version contains new improvements that give a bound of O(2^k n). This provides another significant improvement and is provably correct so these options are also enabled by default. For much larger trees, the -split_approx option will compute an exponential time approximation of the distance that is exact for small distances and generally within a few percent of the optimal distance otherwise. When using the -unrooted option, the exact algorithms run in O(2^k n^2) time. The cluster reduction improves the running time of the algorithm to O(2^k n) time where k is the largest rSPR distance of any cluster (as opposed to the full rSPR distance). This provides a large speedup when the trees are clusterable. With the -pairwise option on m rooted trees, the program takes O(m^2 2^k n) time, where k is the largest SPR distance computed. With -unrooted this becomes O(m^2 2^k n^3). The -pairwise_max x option limits k to x, but does not use clustering and is slow for large distances. - NOTE: This is an exponential algorithm that exactly solves an NP-hard problem. Thus the algorithms may not finish in a reasonable amount of time for large rSPR distances (> 20 without optimizations and > 70 with optimizations). ################################################################################ REFERENCES For more information on the algorithms see: Whidden, C., Zeh, N., Beiko, R.G. Fixed-Parameter and Approximation Algorithms for Maximum Agreement Forests of Multifurcating Trees. (Submitted). 2013. Preprint available at http://arxiv.org/abs/1305.0512 Whidden, C., Zeh, N., Beiko, R.G. Supertrees based on the subtree prune-and-regraft distance. To appear in Systematic Biology. 2014. Preprint available at https://peerj.com/preprints/18/ Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466. Available at http://epubs.siam.org/doi/abs/10.1137/110845045 Whidden, C. Efficient Computation of Maximum Agreement Forests and their Applications. PhD Thesis. Dalhousie University, Canada. 2013. Available at www.cs.dal.ca/~whidden Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms. Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, pp. 141-153. Available at http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13 Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401. Springer-Verlag (2009). Available at http://www.springerlink.com/content/n56q2846v645p655/ Whidden, C. A Unifying View on Approximation and FPT of Agreement Forests. Masters Thesis. Dalhousie University, Canada. 2009. Available at www.cs.dal.ca/~whidden ################################################################################ CITING rSPR If you use rSPR in your research, please cite: Whidden, C., Beiko, R.G., Zeh, N. Computing the SPR Distance of Binary Rooted Trees in O(2^k n) Time. (In Preparation). 2013. Whidden, C., Beiko, R.G. Zeh, N. Fixed-Parameter and Approximation Algorithms for Maximum Agreement Forests of Multifurcating Trees. (Submitted). 2013. Whidden, C., Zeh, N., Beiko, R.G. Supertrees based on the subtree prune-and-regraft distance. Syst. Biol. Advance Access published April 2, 2014, doi:10.1093/sysbio/syu023. Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466. Available at http://epubs.siam.org/doi/abs/10.1137/110845045 Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms. Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, pp. 141-153. Available at http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13 Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401. Springer-Verlag (2009). ################################################################################