TY - GEN
T1 - On the Maximal Independent Sets of k-mers with the Edit Distance
AU - Ma, Leran
AU - Chen, Ke
AU - Shao, Mingfu
N1 - Publisher Copyright:
© 2023 Owner/Author(s).
PY - 2023/9/3
Y1 - 2023/9/3
N2 - In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace.
AB - In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace.
UR - http://www.scopus.com/inward/record.url?scp=85175861905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175861905&partnerID=8YFLogxK
U2 - 10.1145/3584371.3612982
DO - 10.1145/3584371.3612982
M3 - Conference contribution
C2 - 38050580
AN - SCOPUS:85175861905
T3 - ACM-BCB 2023 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
BT - ACM-BCB 2023 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
T2 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2023
Y2 - 3 September 2023 through 6 September 2023
ER -