TY - GEN
T1 - Optimal omnitig listing for safe and complete contig assembly
AU - Cairo, Massimo
AU - Medvedev, Paul
AU - Acosta, Nidia Obscura
AU - Rizzi, Romeo
AU - Tomescu, Alexandru I.
N1 - Funding Information:
We acknowledge partial support from the Academy of Finland grant 274977 to A.I.T. and grant 284598 (CoECGR) to N.O.A., from the European Union's Horizon 2020 Marie Skłodowska-Curie grant agreement No 690941 to M.C. and A.I.T., and from NSF awards DBI-1356529, CCF-1439057, IIS-1453527, and IIS-1421908 to P.M.
Publisher Copyright:
© Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I. Tomescu.
PY - 2017/7/1
Y1 - 2017/7/1
N2 - Genome assembly is the problem of reconstructing a genome sequence from a set of reads from a sequencing experiment. Typical formulations of the assembly problem admit in practice many genomic reconstructions, and actual genome assemblers usually output contigs, namely substrings that are promised to occur in the genome. To bridge the theory and practice, Tomescu and Medvedev [RECOMB 2016] reformulated contig assembly as finding all substrings common to all genomic reconstructions. They also gave a characterization of those walks (omnitigs) that are common to all closed edge-covering walks of a (directed) graph, a typical notion of genomic reconstruction. An algorithm for listing all maximal omnitigs was also proposed, by launching an exhaustive visit from every edge. In this paper, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Ω(nm). We implement this algorithm and show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev [RECOMB 2016].
AB - Genome assembly is the problem of reconstructing a genome sequence from a set of reads from a sequencing experiment. Typical formulations of the assembly problem admit in practice many genomic reconstructions, and actual genome assemblers usually output contigs, namely substrings that are promised to occur in the genome. To bridge the theory and practice, Tomescu and Medvedev [RECOMB 2016] reformulated contig assembly as finding all substrings common to all genomic reconstructions. They also gave a characterization of those walks (omnitigs) that are common to all closed edge-covering walks of a (directed) graph, a typical notion of genomic reconstruction. An algorithm for listing all maximal omnitigs was also proposed, by launching an exhaustive visit from every edge. In this paper, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Ω(nm). We implement this algorithm and show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev [RECOMB 2016].
UR - http://www.scopus.com/inward/record.url?scp=85027272139&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027272139&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2017.29
DO - 10.4230/LIPIcs.CPM.2017.29
M3 - Conference contribution
AN - SCOPUS:85027272139
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017
A2 - Radoszewski, Jakub
A2 - Karkkainen, Juha
A2 - Radoszewski, Jakub
A2 - Rytter, Wojciech
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017
Y2 - 4 July 2017 through 6 July 2017
ER -