On the Maximal Independent Sets of k-mers with the Edit Distance

Leran Ma, Ke Chen, Mingfu Shao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace.

Original languageEnglish (US)
Title of host publicationACM-BCB 2023 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400701269
DOIs
StatePublished - Sep 3 2023
Event14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2023 - Houston, United States
Duration: Sep 3 2023Sep 6 2023

Publication series

NameACM-BCB 2023 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Conference

Conference14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2023
Country/TerritoryUnited States
CityHouston
Period9/3/239/6/23

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'On the Maximal Independent Sets of k-mers with the Edit Distance'. Together they form a unique fingerprint.

Cite this