Abstract
Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).
Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.
Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.
Contact: [email protected].
Supplementary information: Supplementary data are available at Bioinformatics online.
Original language | English (US) |
---|---|
Pages (from-to) | 4024-4032 |
Number of pages | 9 |
Journal | Bioinformatics (Oxford, England) |
Volume | 33 |
Issue number | 24 |
DOIs | |
State | Published - Dec 15 2017 |
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Computational Theory and Mathematics
- Computational Mathematics