Approximating LCS and Alignment Distance over Multiple Sequences

Debarati Das, Barna Saha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(nm) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õm(n⌈m2 ⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(nαm) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p, B)-pseudorandom then, we can get a below-two approximation in Õm(nBm-1 + n⌊m2 ⌋+3) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(p1) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õm(n⌊m2 ⌋+2) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λn for some λ ∈ [0, 1], then in Õm(n⌊m2 ⌋+1)1 time, we can return a common subsequence of length at least λ2+2nϵ for any arbitrary constant ϵ > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ4n).

Original languageEnglish (US)
Title of host publicationApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2022
EditorsAmit Chakrabarti, Chaitanya Swamy
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959772495
DOIs
StatePublished - Sep 1 2022
Event25th International Conference on Approximation Algorithms for Combinatorial Optimization Problems and the 26th International Conference on Randomization and Computation, APPROX/RANDOM 2022 - Virtual, Urbana-Champaign, United States
Duration: Sep 19 2022Sep 21 2022

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume245
ISSN (Print)1868-8969

Conference

Conference25th International Conference on Approximation Algorithms for Combinatorial Optimization Problems and the 26th International Conference on Randomization and Computation, APPROX/RANDOM 2022
Country/TerritoryUnited States
CityVirtual, Urbana-Champaign
Period9/19/229/21/22

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Approximating LCS and Alignment Distance over Multiple Sequences'. Together they form a unique fingerprint.

Cite this