By Bang Ye Wu, E-mail; Home Page
A triple specifies the
relation of three species in an evolutionary tree. Given a species set S and
some local trees of subsets of S, the maximum consensus tree is an evolutionary
tree of S such that the number of satisfied triples of the local trees is
maximized.
The problem has been show to
be NP-hard. ConTrip is a computer program that constructs the maximum consensus
tree exactly or approximately.
The algorithms implemented
in ConTrip are:
n
The
exact algorithm uses the dynamic programming strategy and is used to find the
optimal solution.
n
The Best-One-Split-First
(BOSF) algorithm (top-down splitting strategy).
n
The
Min-Cut-Split-First (MCSF) algorithm (top-down splitting strategy), derived
from the exact algorithm for compatible triples.
n
The
Best-Pair-Merge-First (BPMF) algorithm (bottom-up merging strategy), including
6 scoring functions.
Details of the algorithms
and performance report should be referred to the original papers.
n
Gasieniec,L.,
Jansson,J., Lingas,A. and \"{O}stlin,A. (1999) On the complexity of
constructing evolutionary trees, Journal of Combinatorial Optimization, 3,
183--197.
n
Wu,B.Y.
(2003) Constructing the maximum consensus tree from rooted triples, to appear
in Journal of Combinatorial Optimization.
The executable program was
generated by Microsoft Visual C++ 6.0. But it uses only standard C
instructions, and can be correctly compiled by DJGPP, the GNU C Compiler (gcc)
for DOS.
The executable is generated
for PC with Intel x86. The platform is WIN32. It requires a 32-bit system and
large memory. If you want to run it under the old MS-DOS, variable declarations
should be modified.
n
Download
the program: source in C. The zip file includes the
source code and a sample input. The
executable: for X86 and WIN32.
n
The
program is free for non-commercial uses. Please cite the original
papers.
n
Execution
Command: contrip [input-filename] [output-filename]
[if-exact]
When [if-exact]=1, it runs the exact algorithm.
Otherwise heuristic algorithms are used, and the best is output. But if the
number of species is more than 20, the heuristics will be used.
n
The
input file should be in the following format:
[num_species] [num_trees]
[tree in Phylip form]:[weight];
[tree in Phylip form]:[weight];
….
n
The
first two integers are the numbers of species (<128) and trees (<5000).
Followed the two numbers are the input trees in the Phylip format, e.g.,
(0,(1,2)), separated by ";". A "." indicates
the end of the input, and comments may be put after the period. The weight of
each tree should be positive integer.
n
The
labels of nodes should be 0,1,2,... ,
n
The
output is the tree in Phylip format.
n
A
sample input file :
6 3
((0,(1,2)),(4,5)):10;((2,3),(0,4)):3;
(5,(1,4)):5;
. the text after this period
will be ignored.
n
The first line => there are
6 species and labelled by 0,1,2,3,4,5. The number of input trees is 3.
The second line => the first and the second input trees with weight 10 and 3
respectively.
The 3rd line=> the 3rd input tree
with weight 5.
The 4th line => the period indicates the end of input.
n
The output for the sample
input: (exact solution)
((5,
4),
(0,
(3,
(2,
1))));
%Number of input trees: 3
%Number of species: 6
%Using method: Exact
%Total weight of satisfied triples: 106.
n
The output for the sample
input: (approximate solution)
(((0,
(1,
2)),
3),
(4,
5));
%Number of input triples: 3
%Number of species: 6
%Using method: Heuristic
%Total weight of satisfied triples: 103.
n
Currently
the maximum number of species is 20 for exact solution and 128 for approximate
solution. The data structure of the exact algorithm can be used for up to 32
species. However, it may take more than one day
to find the optimal solution for more than 20 species (PC with Intel P 1.8G). The exact algorithm uses an unsigned
int for a set. Therefore if you want to run the program under 16-bit
system, the maximum number of species will be 16. Otherwise the data structure
should be modified (such as the linear array used in the program for the
heuristic algorithms).
n
For
approximate solutions, the maximum number of specie can be easily modified for
more species since there is no limit by the data structure and it does not take
too much space and time.
n
The
exact algorithm requires a lot of memory. If you want to run the program under
a system of limited memory (such as the old MS-DOS), the number of species will
be also very limited.
n
For
approximate solutions, the program runs several heuristics and output only the
best. If you want to output all the solutions it found, you need only call the
output procedure when the solutions are obtained (in main procedure).