ConTrip: contructing the maximum consensus tree

By Bang Ye Wu, E-mail; Home Page


Introduction:

A triple specifies the relation of three species in an evolutionary tree. Given a species set S and some local trees of subsets of S, the maximum consensus tree is an evolutionary tree of S such that the number of satisfied triples of the local trees is maximized.

The problem has been show to be NP-hard. ConTrip is a computer program that constructs the maximum consensus tree exactly or approximately.

Algorithms

The algorithms implemented in ConTrip are:

n          The exact algorithm uses the dynamic programming strategy and is used to find the optimal solution.

n          The Best-One-Split-First (BOSF) algorithm (top-down splitting strategy).

n          The Min-Cut-Split-First (MCSF) algorithm (top-down splitting strategy), derived from the exact algorithm for compatible triples.

n          The Best-Pair-Merge-First (BPMF) algorithm (bottom-up merging strategy), including 6 scoring functions.

References

Details of the algorithms and performance report should be referred to the original papers.

n          Gasieniec,L., Jansson,J., Lingas,A. and \"{O}stlin,A. (1999) On the complexity of constructing evolutionary trees, Journal of Combinatorial Optimization, 3, 183--197.

n          Wu,B.Y. (2003) Constructing the maximum consensus tree from rooted triples, to appear in Journal of Combinatorial Optimization.


About the program

The executable program was generated by Microsoft Visual C++ 6.0. But it uses only standard C instructions, and can be correctly compiled by DJGPP, the GNU C Compiler (gcc) for DOS.

The executable is generated for PC with Intel x86. The platform is WIN32. It requires a 32-bit system and large memory. If you want to run it under the old MS-DOS, variable declarations should be modified.

Availability: 

n          Download the program: source in C. The zip file includes the source code and a sample input.  The executable: for X86 and WIN32.

n          The program is free for non-commercial uses. Please cite the original papers.

How to use

n          Execution Command: contrip [input-filename] [output-filename] [if-exact]
When [if-exact]=1, it runs the exact algorithm. Otherwise heuristic algorithms are used, and the best is output. But if the number of species is more than 20, the heuristics will be used.

n          The input file should be in the following format:
[num_species] [num_trees]
[tree in Phylip form]:[weight];
[tree in Phylip form]:[weight];
….

n          The first two integers are the numbers of species (<128) and trees (<5000). Followed the two numbers are the input trees in the Phylip format, e.g., (0,(1,2)), separated by ";". A "." indicates the end of the input, and comments may be put after the period. The weight of each tree should be positive integer.

n          The labels of nodes should be 0,1,2,... ,

n          The output is the tree in Phylip format.

n          A sample input file :
6 3
((0,(1,2)),(4,5)):10;((2,3),(0,4)):3;
(5,(1,4)):5;
.   the text after this period will be ignored.

n          The first line => there are 6 species and labelled by 0,1,2,3,4,5. The number of input trees is 3.
The second line => the first and the second input trees with weight 10 and 3 respectively.
The 3rd line=>
 the 3rd input tree with weight 5. 
The 4th line => the period indicates the end of input.

n          The output for the sample input: (exact solution)
((5,
4),
(0,
(3,
(2,
1))));
%Number of input trees: 3
%Number of species: 6
%Using method: Exact
%Total weight of satisfied triples: 106.

n          The output for the sample input: (approximate solution)
 (((0,
(1,
2)),
3),
(4,
5));
%Number of input triples: 3
%Number of species: 6
%Using method: Heuristic
%Total weight of satisfied triples: 103.


Technical Notes

n          Currently the maximum number of species is 20 for exact solution and 128 for approximate solution. The data structure of the exact algorithm can be used for up to 32 species. However, it may take more than one day to find the optimal solution for more than 20 species (PC with Intel P 1.8G).  The exact algorithm uses an unsigned int for a set. Therefore if you want to run the program under 16-bit system, the maximum number of species will be 16. Otherwise the data structure should be modified (such as the linear array used in the program for the heuristic algorithms).

n          For approximate solutions, the maximum number of specie can be easily modified for more species since there is no limit by the data structure and it does not take too much space and time.

n          The exact algorithm requires a lot of memory. If you want to run the program under a system of limited memory (such as the old MS-DOS), the number of species will be also very limited.

n          For approximate solutions, the program runs several heuristics and output only the best. If you want to output all the solutions it found, you need only call the output procedure when the solutions are obtained (in main procedure).