## Abstract

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(n^{m}) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õm(n^{⌈m2 ⌉}) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(n^{αm}) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p, B)-pseudorandom then, we can get a below-two approximation in Õm(nB^{m-1} + n^{⌊m2 ⌋+3}) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(_{p}^{1}) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õm(n^{⌊m2 ⌋+2}) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λn for some λ ∈ [0, 1], then in Õm(n^{⌊m2 ⌋+1})^{1} time, we can return a common subsequence of length at least ^{λ}_{2+}^{2n}_{ϵ} for any arbitrary constant ϵ > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ^{4}n).

Original language | English (US) |
---|---|

Title of host publication | Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2022 |

Editors | Amit Chakrabarti, Chaitanya Swamy |

Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |

ISBN (Electronic) | 9783959772495 |

DOIs | |

State | Published - Sep 1 2022 |

Event | 25th International Conference on Approximation Algorithms for Combinatorial Optimization Problems and the 26th International Conference on Randomization and Computation, APPROX/RANDOM 2022 - Virtual, Urbana-Champaign, United States Duration: Sep 19 2022 → Sep 21 2022 |

### Publication series

Name | Leibniz International Proceedings in Informatics, LIPIcs |
---|---|

Volume | 245 |

ISSN (Print) | 1868-8969 |

### Conference

Conference | 25th International Conference on Approximation Algorithms for Combinatorial Optimization Problems and the 26th International Conference on Randomization and Computation, APPROX/RANDOM 2022 |
---|---|

Country/Territory | United States |

City | Virtual, Urbana-Champaign |

Period | 9/19/22 → 9/21/22 |

## All Science Journal Classification (ASJC) codes

- Software