Difference between revisions of "SPR Supertrees README"

From Bioinformatics Software
Jump to navigationJump to search
(Created page with "<pre> ################################################################################ rspr ################################################################################ ...")
 
Line 45: Line 45:
 
             FPT algorithm. This is enabled by default.
 
             FPT algorithm. This is enabled by default.
  
-approx         Calculate just a linear-time 3-approximation of the rSPR distance
+
-approx     Calculate just a linear-time 3-approximation of the rSPR distance
  
 
-split_approx
 
-split_approx
 
-split_approx x  Calculate the exact rSPR distance if it is k or less and
 
-split_approx x  Calculate the exact rSPR distance if it is k or less and
        otherwise use the exponential-time approximation
+
                otherwise use the exponential-time approximation
  
 
-cluster_test  Use the cluster reduction to speed up the exact algorithm.
 
-cluster_test  Use the cluster reduction to speed up the exact algorithm.
Line 56: Line 56:
 
-total          Find the total SPR distance from the first input tree to
 
-total          Find the total SPR distance from the first input tree to
 
                 the rest of the list of trees. Uses the other algorithm
 
                 the rest of the list of trees. Uses the other algorithm
options as specified (including unrooted options).
+
                options as specified (including unrooted options).
  
 
*******************************************************************************
 
*******************************************************************************
Line 66: Line 66:
 
just that subset of optimizations.
 
just that subset of optimizations.
  
-allopt Use -cob -cab -sc and a new set of improvements. This is the
+
-allopt   Use -cob -cab -sc and a new set of improvements. This is the
          default
+
          default
 
option
 
option
  
-noopt Use 3-way branching for all FPT algorithms
+
-noopt     Use 3-way branching for all FPT algorithms
  
-cob Use "cut one b" improved branching
+
-cob       Use "cut one b" improved branching
  
-cab Use "cut all b" improved branching
+
-cab       Use "cut all b" improved branching
  
-sc Use "separate components" improved branching
+
-sc       Use "separate components" improved branching
  
 
*******************************************************************************
 
*******************************************************************************
Line 91: Line 91:
 
             Output the best found distance and agreement forest.
 
             Output the best found distance and agreement forest.
 
             This option can be used with gen_rooted_trees.pl to provide
 
             This option can be used with gen_rooted_trees.pl to provide
the rootings.
+
            the rootings.
Note that this option is a bit unintuitive to maintain
+
            Note that this option is a bit unintuitive to maintain
compatibility with previous versions of rSPR.
+
            compatibility with previous versions of rSPR.
If -total or -pairwise analysis is used then there is no need
+
            If -total or -pairwise analysis is used then there is no need
to specify rootings.
+
            to specify rootings.
  
 
-unrooted_min_approx    Compare the first input tree to each other input tree.
 
-unrooted_min_approx    Compare the first input tree to each other input tree.
Line 101: Line 101:
 
                         minimum approximate rspr distance
 
                         minimum approximate rspr distance
  
-simple_unrooted Root the gene trees using
+
-simple_unrooted       Root the gene trees using
 
                         a bipartition balanced accuracy measure
 
                         a bipartition balanced accuracy measure
(fast but potentially less accurate). Only
+
                        (fast but potentially less accurate). Only
used with -total.
+
                        used with -total.
  
 
*******************************************************************************
 
*******************************************************************************
Line 164: Line 164:
  
 
ClusterForest.h  Cluster Decomposition
 
ClusterForest.h  Cluster Decomposition
Forest.h   Forest data structure
+
Forest.h         Forest data structure
gen_rooted_trees.pl Generate all rootings of an unrooted binary tree
+
gen_rooted_trees.pl   Generate all rootings of an unrooted binary tree
gpl.txt   The GPL license
+
gpl.txt           The GPL license
 
LCA.h            Compute LCAs of tree leaves
 
LCA.h            Compute LCAs of tree leaves
Makefile   Makefile
+
Makefile         Makefile
Node.h   Node data structure
+
Node.h           Node data structure
README.txt This README
+
README.txt       This README
rspr.h     Library to calculate rSPR distances between pairs of trees
+
rspr.h           Library to calculate rSPR distances between pairs of trees
rspr.cpp Calculate rSPR distances between pairs or sets of trees
+
rspr.cpp         Calculate rSPR distances between pairs or sets of trees
test_trees/ Folder of test tree pairs
+
test_trees/       Folder of test tree pairs
SiblingPair.h     Sibling Pair class
+
SiblingPair.h     Sibling Pair class
  
 
################################################################################
 
################################################################################

Revision as of 22:59, 29 April 2014

################################################################################
rspr

################################################################################

Usage: rspr [OPTIONS]
Calculate approximate and exact Subtree Prune and Regraft (rSPR)
distances and the associated maximum agreement forests (MAFs) between pairs
of rooted binary trees from STDIN in newick format. Supports arbitrary labels.
The second tree may be multifurcating. 

Can also compare the first input tree to each other tree with -total or
compute a pairwise distance matrix with -pairwise.

Copyright 2009-2014 Chris Whidden
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/RSPR
April 29, 2014
Version 1.2.2

This file is part of rspr.

rspr is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
rspr is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with rspr.  If not, see <http://www.gnu.org/licenses/>.

*******************************************************************************
ALGORITHM
*******************************************************************************

These options control what algorithm is used

-fpt        Calculate the exact rSPR distance with an FPT algorithm

-bb         Calculate the exact rSPR distance with a branch-and-bound
            FPT algorithm. This is enabled by default.

-approx     Calculate just a linear-time 3-approximation of the rSPR distance

-split_approx
-split_approx x  Calculate the exact rSPR distance if it is k or less and
                 otherwise use the exponential-time approximation

-cluster_test   Use the cluster reduction to speed up the exact algorithm.
                This is enabled by default.

-total          Find the total SPR distance from the first input tree to
                the rest of the list of trees. Uses the other algorithm
                options as specified (including unrooted options).

*******************************************************************************
OPTIMIZATIONS
*******************************************************************************

These options control the use of optimized branching. All optimizations are
enabled by default. Specifying any subset of -cob, -cab, and -sc will use
just that subset of optimizations.

-allopt    Use -cob -cab -sc and a new set of improvements. This is the
           default
option

-noopt     Use 3-way branching for all FPT algorithms

-cob       Use "cut one b" improved branching

-cab       Use "cut all b" improved branching

-sc        Use "separate components" improved branching

*******************************************************************************
MULTIFURCATING COMPARISON OPTIONS
*******************************************************************************

-support x     Collapse bipartitions with less than x support

*******************************************************************************
UNROOTED COMPARISON OPTIONS
*******************************************************************************

-unrooted   Compare the first input tree to each other input tree.
            Output the best found distance and agreement forest.
            This option can be used with gen_rooted_trees.pl to provide
            the rootings.
            Note that this option is a bit unintuitive to maintain
            compatibility with previous versions of rSPR.
            If -total or -pairwise analysis is used then there is no need
            to specify rootings.

-unrooted_min_approx    Compare the first input tree to each other input tree.
                        Run the exact algorithms on the pair with the
                        minimum approximate rspr distance

-simple_unrooted        Root the gene trees using
                        a bipartition balanced accuracy measure
                        (fast but potentially less accurate). Only
                        used with -total.

*******************************************************************************
PAIRWISE COMPARISON OPTIONS
*******************************************************************************

-pairwise
-pairwise a b
-pairwise a b c d        Compare each input tree to each other tree and output
                         the resulting SPR distance matrix. If -unrooted is
                         enabled this will compute the "best rooting" SPR
                         distance by testing each rooting of the trees. The
                         optional arguments a b c d compute only rows a-b and/or
                         columns c-d of the matrix.

-no-symmetric-pairwise   By default, -pairwise will ignore the symmetric lower
                         left triangle of the matrix. With this option the
                         lower triangle is filled in.

-pairwise_max x          Use with -pairwise to only compute distances at most x.
                         Larger values are output as -1. Very efficient for
                         small distances (e.g. 1-10).

*******************************************************************************
OTHER OPTIONS
*******************************************************************************
-cc         Calculate a potentially better approximation with a quadratic time
            algorithm

-q          Quiet; Do not output the input trees or approximation
*******************************************************************************

Example:
$ ./rspr.exe -fpt <test_trees/trees2.txt
T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16))))
T2: (((7,8),((1,(2,(14,5))),(3,4))),(((11,(6,12)),10),((13,(15,16)),9)))

F1: (((1,2),(3,4)),(7,8)) 14 13 5 12 11 6 9 10 (15,16)
F2: ((7,8),((1,2),(3,4))) 14 5 13 12 6 11 9 (15,16) 10
approx drSPR=9

3 4
F1: ((((1,2),(3,4)),(7,8)),((10,(11,12)),(13,(15,16)))) 14 6 9 5
F2: (((7,8),((1,2),(3,4))),(((11,12),10),(13,(15,16)))) 14 6 9 5
exact drSPR=4

################################################################################

CONTACT INFORMATION

Chris Whidden
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/RSPR

################################################################################

FILES


ClusterForest.h   Cluster Decomposition
Forest.h          Forest data structure
gen_rooted_trees.pl    Generate all rootings of an unrooted binary tree
gpl.txt           The GPL license
LCA.h             Compute LCAs of tree leaves
Makefile          Makefile
Node.h            Node data structure
README.txt        This README
rspr.h            Library to calculate rSPR distances between pairs of trees
rspr.cpp          Calculate rSPR distances between pairs or sets of trees
test_trees/       Folder of test tree pairs
SiblingPair.h     Sibling Pair class

################################################################################

INSTALLATION

rSPR is a command-line program written in C++. To use it, simply
compile rspr.cpp and execute the resulting program. On systems with
the g++ compiler and make program, the included make file will
compile rspr; simply run `make'.

################################################################################

INPUT

rSPR requires pairs of Newick format trees with arbitrary labels
as input. The first tree must be binary and rooted. The second tree
may be multifurcating and rooted. A sample Newick tree is shown below:

((1,2),(3,4),(5,6));

rSPR can also compare a rooted reference tree to an unrooted test tree.
First use gen_rooted_trees.pl to generate all rootings of the unrooted
test tree. Then use the -unrooted or -unrooted_min_approx options and
input the test tree and the set of rootings. rSPR will find the best
rooting of the test tree with the -unrooted option and guess the best 
rooting based on the approximation algorithm with the
-unrooted_min_approx option. Alternatively, the -total option with
the -unrooted or -unrooted_min_approx options will provide just the
distance. The -total option with -simple_unrooted will use a faster
biparition based measure to approximate the optimal rooting.

The -support x option can be used to collapse poorly supported branches
of the second tree.

With the -pairwise option, rSPR will compare each pair of input trees
and output the results as a distance matrix. To save time, only the
upper right triangle is output as the lower left triangle is symmetric.
Use the included fill_matrix program to fill in missing values or the
-no-symmetric-pairwise option to explicitly compute these values.
Optional arguments to -pairwise can be used to compute subsets of the
matrix (e.g. for partitioning computation over multiple processes).
The -pairwise_max x option can be used to quickly find trees with
SPR distance at most x when x is small (e.g. 1-10).


################################################################################

OUTPUT

rspr writes to standard output.

A sample command line and output are shown below:

/////////////////////

$ ./rspr < test_trees/trees2.txt
T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16))))
T2: ((((3,4),(8,(2,((11,12),1)))),((15,16),(7,(6,5)))),(14,((10,13),9)))

F1: ((3,4),(5,6)) 13 14 10 (11,12) 9 1 8 7 2 (15,16)
F2: ((3,4),(6,5)) 13 10 14 (11,12) 1 9 8 2 7 (15,16)
approx drSPR=12

4
F1: ((((1,2),(3,4)),((5,6),7)),((9,10),14)) 13 (11,12) 8 (15,16)
F2: ((((3,4),(2,1)),(7,(6,5))),(14,(10,9))) 13 (11,12) 8 (15,16)
exact BB drSPR=4

/////////////////////

The first set of lines show the input trees. The second set of lines are the
approximate agreement forests and the corresponding approximate rSPR distance.
The third set of lines are the maximum agreement forests and the corresponding
exact rSPR distance. When calculating exact distances, the distance
currently being considered is printed on the first line of this section.

Each component of an agreement forest corresponds to an rSPR operation. 
The set of rSPR operations required to turn one tree into the other can
be found by applying rSPR operations that move these components to their
correct place in the other tree.

An agreement forest may contain p (rho) as a component. This represents
the root of the trees and indicates that an extra rSPR operation is
required to correctly root the tree.

################################################################################

OUTPUT WITH CLUSTERING

/////////////////////

$ rspr < test_trees/cluster_test 
T1: (((x,((b1,b3),b2)),y),(f,(a,c)))
T2: (((x,y),f),((a,((b1,b2),b3)),c))

F1: (((0,((1,2),3)),4),(5,(6,7))) 
F2: (((0,4),5),((6,((1,3),2)),7)) 
approx drSPR=9


CLUSTERS
C1_1: ((1,2),3) 
C1_2: ((1,3),2) 
cluster approx drSPR=3

1 
F1_1: (1,2) 3 
F1_2: (1,2) 3 
cluster exact drSPR=1

C2_1: (((0,(1,2)),4),(5,(6,7))) 
C2_2: (((0,4),5),((6,(1,2)),7)) 
cluster approx drSPR=6

2 
F2_1: (5,(6,7)) (1,2) (0,4) 
F2_2: (5,(6,7)) (1,2) (0,4) 
cluster exact drSPR=2

F1: (f,(a,c)) b2 (b1,b3) (x,y) 
F2: (f,(a,c)) b2 (b1,b3) (x,y) 
total exact drSPR=3

/////////////////////

When clustering is enabled (as it is by default), each solved
cluster is displayed along with its approximate and exact distance in
an intermediate representation with labels mapped from 0-(N-1) where
N is the number of labels. The final agreement forest and distance
are output last.

################################################################################

OUTPUT WITH PAIRWISE

$ cat test_trees/big_test* | rspr -pairwise
0,46,0,46
,0,46,50
,,0,46
,,,0

cat test_trees/big_test* | rspr -pairwise | fill_matrix
0,46,0,46
46,0,46,50
0,46,0,46
46,50,46,0

################################################################################

EFFICIENCY

The 3-approximation algorithm runs in O(n) time, where n is the number of
leaves in the trees.

The unoptimized FPT and branch-and-bound algorithms run in O(3^k n) time, where
k is the rSPR distance and n is the number of leaves in the trees. The
branch-and-bound algorithm should be significantly faster in practice.

Using all 3 of the -cob -cab and -sc optimizations improves the running times of
the algorithms to O(2.42^k n) time. This provides a significant improvement in
practice and is provably correct, thus this is the default.

In addition, this version contains new improvements that
give a bound of O(2^k n). This provides another significant improvement
and is provably correct so these options are also enabled by default.

For much larger trees, the -split_approx option will compute an
exponential time approximation of the distance that is exact for
small distances and generally within a few percent of the optimal 
distance otherwise.

When using the -unrooted option, the exact algorithms run in O(2^k n^2) time.

The cluster reduction improves the running time of the
algorithm to O(2^k n) time where k is the largest rSPR distance of
any cluster (as opposed to the full rSPR distance). This provides a large
speedup when the trees are clusterable.

With the -pairwise option on m rooted trees, the program takes O(m^2
2^k n) time, where k is the largest SPR distance computed. With -unrooted
this becomes O(m^2 2^k n^3). The -pairwise_max x option limits k to x, 
but does not use clustering and is slow for large distances.
-

NOTE: This is an exponential algorithm that exactly solves an NP-hard problem.
Thus the algorithms may not finish in a reasonable amount of time for large
rSPR distances (> 20 without optimizations and > 70 with optimizations).

################################################################################

REFERENCES

For more information on the algorithms see:

Whidden, C., Zeh, N., Beiko, R.G.  Fixed-Parameter and Approximation
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
(Submitted). 2013. Preprint available at
http://arxiv.org/abs/1305.0512

Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
prune-and-regraft distance.  To appear in Systematic Biology. 2014. Preprint
available at https://peerj.com/preprints/18/

Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466.
Available at http://epubs.siam.org/doi/abs/10.1137/110845045

Whidden, C. Efficient Computation of Maximum Agreement Forests and their
Applications. PhD Thesis. Dalhousie University, Canada. 2013. Available at
www.cs.dal.ca/~whidden

Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms.
Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer
Berlin Heidelberg, 2010, pp. 141-153. Available at
http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13

Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of
Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401.
Springer-Verlag (2009). Available at
http://www.springerlink.com/content/n56q2846v645p655/

Whidden, C. A Unifying View on Approximation and FPT of Agreement Forests.
Masters Thesis. Dalhousie University, Canada. 2009. Available at
www.cs.dal.ca/~whidden

################################################################################

CITING rSPR

If you use rSPR in your research, please cite:

Whidden, C., Beiko, R.G., Zeh, N.  Computing the SPR Distance of Binary
Rooted Trees in O(2^k n) Time. (In Preparation). 2013.

Whidden, C., Beiko, R.G. Zeh, N.  Fixed-Parameter and Approximation
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
(Submitted). 2013.

Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
prune-and-regraft distance. Syst. Biol. Advance Access published April
2, 2014, doi:10.1093/sysbio/syu023.

Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466.
Available at http://epubs.siam.org/doi/abs/10.1137/110845045

Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms.
Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer
Berlin Heidelberg, 2010, pp. 141-153. Available at
http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13

Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of
Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401.
Springer-Verlag (2009).

################################################################################