TY - JOUR
T1 - Streaming breakpoint graph analytics for accelerating and parallelizing the computation of DCJ median of three genomes
AU - Yin, Zhaoming
AU - Tang, Jijun
AU - Schaeffer, Stephen W.
AU - Bader, David A.
N1 - Funding Information:
1This Research was sponsored in part by the NSF PetaApps Grants OCI-0904461 (Bader), OCI-0904179 (Tang), OCI-0904166 (Schaeffer) and NSF OCI-1214504 (Bader). ∗ Corresponding author, URL: http://www.cc.gatech.edu/∼bader/ (David A. Bader )
PY - 2013
Y1 - 2013
N2 - The problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. The median problem using Double-Cut-and-Join (DCJ) distance is NP-hard and the best exact algorithm is based on a branch-and-bound best-first search strategy to explore sub-graph patterns in Multiple BreakPoint Graph (MBG). In this paper, by taking advantage of the "streaming" property of MBG, we introduce the "footprint-based" data structure to reduce the space requirement of a single search nodes from O(v2) to O(v); minimize the redundant computation in counting cycles/paths to update bounds, which leads to dramatically decrease of workload of a single search node. Additional heuristic of branching strategy is introduced to help reducing the searching space. Last but not least, the introduction of a multi-thread shared memory parallel algorithm with two load balancing strategies bring in additional benefit by distributing search work efficiently among different processors. We conduct extensive experiments on simulated datasets and our results show significant improvement on all datasets. And we test our DCJ median algorithm with GASTS, a state of the art software phylogenetic tree construction package. On the real high resolution Drosophila data set, our exact algorithm run as fast as the heuristic algorithm and help construct a better phylogenetic tree.
AB - The problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. The median problem using Double-Cut-and-Join (DCJ) distance is NP-hard and the best exact algorithm is based on a branch-and-bound best-first search strategy to explore sub-graph patterns in Multiple BreakPoint Graph (MBG). In this paper, by taking advantage of the "streaming" property of MBG, we introduce the "footprint-based" data structure to reduce the space requirement of a single search nodes from O(v2) to O(v); minimize the redundant computation in counting cycles/paths to update bounds, which leads to dramatically decrease of workload of a single search node. Additional heuristic of branching strategy is introduced to help reducing the searching space. Last but not least, the introduction of a multi-thread shared memory parallel algorithm with two load balancing strategies bring in additional benefit by distributing search work efficiently among different processors. We conduct extensive experiments on simulated datasets and our results show significant improvement on all datasets. And we test our DCJ median algorithm with GASTS, a state of the art software phylogenetic tree construction package. On the real high resolution Drosophila data set, our exact algorithm run as fast as the heuristic algorithm and help construct a better phylogenetic tree.
UR - http://www.scopus.com/inward/record.url?scp=84896925434&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84896925434&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2013.05.220
DO - 10.1016/j.procs.2013.05.220
M3 - Conference article
AN - SCOPUS:84896925434
SN - 1877-0509
VL - 18
SP - 561
EP - 570
JO - Procedia Computer Science
JF - Procedia Computer Science
T2 - 13th Annual International Conference on Computational Science, ICCS 2013
Y2 - 5 June 2013 through 7 June 2013
ER -