TY - GEN
T1 - Prokrustean Graph
T2 - 29th International Conference on Research in Computational Molecular Biology, RECOMB 2025
AU - Park, Adam
AU - Koslicki, David
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - The widespread use of k-mers in bioinformatics has enabled efficient genomic sequence analysis across various tasks. However, the impact of k-mer size remains unclear, as complex bioinformatics pipelines obscure its influence with noise. The choice of k is often arbitrary and rarely justified in literature or tutorials. Moreover, methods using multiple k-mer sizes face substantial computational challenges. Most methods rely on well-defined k-mer objects like de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer size in these objects is intuitive and described by various metrics. Exploring these objects across k-mer sizes enables robust analyses and new applications. However, how k-mer objects evolve with size remains elusive, and modern substring indices struggle to explore them across sizes. We introduce the Prokrustean graph, a novel substring index that tracks k-mer set transformations across sizes. Built on this index, our framework efficiently computes k-mer-based quantities for all sizes, with complexity independent of the range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k=1,⋯,100 takes seconds on a gigabase-scale dataset. We demonstrate this with experiments in pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Our implementation is available at: https://github.com/KoslickiLab/prokrustean. The full version of this manuscript, including construction and four application algorithms, is available at: https://doi.org/10.1101/2023.11.21.568151.
AB - The widespread use of k-mers in bioinformatics has enabled efficient genomic sequence analysis across various tasks. However, the impact of k-mer size remains unclear, as complex bioinformatics pipelines obscure its influence with noise. The choice of k is often arbitrary and rarely justified in literature or tutorials. Moreover, methods using multiple k-mer sizes face substantial computational challenges. Most methods rely on well-defined k-mer objects like de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer size in these objects is intuitive and described by various metrics. Exploring these objects across k-mer sizes enables robust analyses and new applications. However, how k-mer objects evolve with size remains elusive, and modern substring indices struggle to explore them across sizes. We introduce the Prokrustean graph, a novel substring index that tracks k-mer set transformations across sizes. Built on this index, our framework efficiently computes k-mer-based quantities for all sizes, with complexity independent of the range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k=1,⋯,100 takes seconds on a gigabase-scale dataset. We demonstrate this with experiments in pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Our implementation is available at: https://github.com/KoslickiLab/prokrustean. The full version of this manuscript, including construction and four application algorithms, is available at: https://doi.org/10.1101/2023.11.21.568151.
UR - https://www.scopus.com/pages/publications/105004255011
UR - https://www.scopus.com/pages/publications/105004255011#tab=citedBy
U2 - 10.1007/978-3-031-90252-9_14
DO - 10.1007/978-3-031-90252-9_14
M3 - Conference contribution
AN - SCOPUS:105004255011
SN - 9783031902512
T3 - Lecture Notes in Computer Science
SP - 227
EP - 249
BT - Research in Computational Molecular Biology - 29th International Conference, RECOMB 2025, Proceedings
A2 - Sankararaman, Sriram
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 26 April 2025 through 29 April 2025
ER -