Sequence Similarity Estimation by Random Subsequence Sketching

Ke Chen, Vinamratha Pattar, Mingfu Shao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.

Original languageEnglish (US)
Title of host publication25th International Conference on Algorithms for Bioinformatics, WABI 2025
EditorsBrona Brejova, Rob Patro
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959773867
DOIs
StatePublished - Aug 15 2025
Event25th International Conference on Algorithms for Bioinformatics, WABI 2025 - College Park, United States
Duration: Aug 20 2025Aug 22 2025

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume344
ISSN (Print)1868-8969

Conference

Conference25th International Conference on Algorithms for Bioinformatics, WABI 2025
Country/TerritoryUnited States
CityCollege Park
Period8/20/258/22/25

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Sequence Similarity Estimation by Random Subsequence Sketching'. Together they form a unique fingerprint.

Cite this