TY - GEN
T1 - Sequence Similarity Estimation by Random Subsequence Sketching
AU - Chen, Ke
AU - Pattar, Vinamratha
AU - Shao, Mingfu
N1 - Publisher Copyright:
© Ke Chen, Vinamratha Pattar, and Mingfu Shao.
PY - 2025/8/15
Y1 - 2025/8/15
N2 - Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.
AB - Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.
UR - https://www.scopus.com/pages/publications/105013847658
UR - https://www.scopus.com/inward/citedby.url?scp=105013847658&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.WABI.2025.7
DO - 10.4230/LIPIcs.WABI.2025.7
M3 - Conference contribution
AN - SCOPUS:105013847658
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 25th International Conference on Algorithms for Bioinformatics, WABI 2025
A2 - Brejova, Brona
A2 - Patro, Rob
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 25th International Conference on Algorithms for Bioinformatics, WABI 2025
Y2 - 20 August 2025 through 22 August 2025
ER -