Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The widespread use of k-mers in bioinformatics has enabled efficient genomic sequence analysis across various tasks. However, the impact of k-mer size remains unclear, as complex bioinformatics pipelines obscure its influence with noise. The choice of k is often arbitrary and rarely justified in literature or tutorials. Moreover, methods using multiple k-mer sizes face substantial computational challenges. Most methods rely on well-defined k-mer objects like de Bruijn graphs, Jaccard similarity, Bray-Curtis dissimilarity, and k-mer spectra. The role of k-mer size in these objects is intuitive and described by various metrics. Exploring these objects across k-mer sizes enables robust analyses and new applications. However, how k-mer objects evolve with size remains elusive, and modern substring indices struggle to explore them across sizes. We introduce the Prokrustean graph, a novel substring index that tracks k-mer set transformations across sizes. Built on this index, our framework efficiently computes k-mer-based quantities for all sizes, with complexity independent of the range and dependent only on maximal repeats. For example, counting maximal simple paths in de Bruijn graphs for k=1,⋯,100 takes seconds on a gigabase-scale dataset. We demonstrate this with experiments in pangenomics and metagenomics. The Prokrustean graph is space-efficiently constructed from the Burrows-Wheeler Transform. Our implementation is available at: https://github.com/KoslickiLab/prokrustean. The full version of this manuscript, including construction and four application algorithms, is available at: https://doi.org/10.1101/2023.11.21.568151.

Original languageEnglish (US)
Title of host publicationResearch in Computational Molecular Biology - 29th International Conference, RECOMB 2025, Proceedings
EditorsSriram Sankararaman
PublisherSpringer Science and Business Media Deutschland GmbH
Pages227-249
Number of pages23
ISBN (Print)9783031902512
DOIs
StatePublished - 2025
Event29th International Conference on Research in Computational Molecular Biology, RECOMB 2025 - Seoul, Korea, Republic of
Duration: Apr 26 2025Apr 29 2025

Publication series

NameLecture Notes in Computer Science
Volume15647 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th International Conference on Research in Computational Molecular Biology, RECOMB 2025
Country/TerritoryKorea, Republic of
CitySeoul
Period4/26/254/29/25

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Prokrustean Graph: A Substring Index for Rapid K-Mer Size Analysis'. Together they form a unique fingerprint.

Cite this