TY - GEN
T1 - Approximating LCS and Alignment Distance over Multiple Sequences
AU - Das, Debarati
AU - Saha, Barna
N1 - Publisher Copyright:
© Debarati Das and Barna Saha.
PY - 2022/9/1
Y1 - 2022/9/1
N2 - We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(nm) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õm(n⌈m2 ⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(nαm) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p, B)-pseudorandom then, we can get a below-two approximation in Õm(nBm-1 + n⌊m2 ⌋+3) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(p1) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õm(n⌊m2 ⌋+2) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λn for some λ ∈ [0, 1], then in Õm(n⌊m2 ⌋+1)1 time, we can return a common subsequence of length at least λ2+2nϵ for any arbitrary constant ϵ > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ4n).
AB - We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(nm) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õm(n⌈m2 ⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(nαm) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p, B)-pseudorandom then, we can get a below-two approximation in Õm(nBm-1 + n⌊m2 ⌋+3) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(p1) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õm(n⌊m2 ⌋+2) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λn for some λ ∈ [0, 1], then in Õm(n⌊m2 ⌋+1)1 time, we can return a common subsequence of length at least λ2+2nϵ for any arbitrary constant ϵ > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ4n).
UR - http://www.scopus.com/inward/record.url?scp=85139159584&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85139159584&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.APPROX/RANDOM.2022.54
DO - 10.4230/LIPIcs.APPROX/RANDOM.2022.54
M3 - Conference contribution
AN - SCOPUS:85139159584
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2022
A2 - Chakrabarti, Amit
A2 - Swamy, Chaitanya
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 25th International Conference on Approximation Algorithms for Combinatorial Optimization Problems and the 26th International Conference on Randomization and Computation, APPROX/RANDOM 2022
Y2 - 19 September 2022 through 21 September 2022
ER -