TY - GEN
T1 - Fitting Tree Metrics and Ultrametrics in Data Streams
AU - Carmel, Amir
AU - Das, Debarati
AU - Kipouridis, Evangelos
AU - Pipis, Evangelos
N1 - Publisher Copyright:
© Amir Carmel, Debarati Das, Evangelos Kipouridis, and Evangelos Pipis.
PY - 2025/6/30
Y1 - 2025/6/30
N2 - Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Formally, given a positive distance function D : (V 2) → ℝ>0, the goal is to find a tree (or an ultrametric) T including all elements of set V, such that the difference between the distances among vertices in T and those specified by D is minimized. Numerical taxonomy was first introduced by Sneath and Sokal [Nature 1962], and since then it has been studied extensively in both biology and computer science. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from V (with |V | “n), defined by the function D, can arrive in an arbitrary order. We study these problems under various distance norms; namely the ℓ0 objective, which aims to minimize the number of modified entries in D to fit a tree-metric or an ultrametric; the ℓ1 objective, which seeks to minimize the total sum of distance errors across all pairs of points in V ; and the ℓ∞ objective, which focuses on minimizing the maximum error incurred by any entries in D. Our first result addresses the ℓ0 objective. We provide a single-pass polynomial-time Õ(n)-space O(1) approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for ℓ0 implies an O(∆/δ) approximation for the ℓ1 objective, where ∆ is the maximum, and δ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when ∆/δ “O(n). For the ℓ∞ objective, we provide a complete characterization of the ultrametric fitting problem. First, we present a single-pass polynomial-time Õ(n)-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. Furthermore, we show that with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend all these results to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.
AB - Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Formally, given a positive distance function D : (V 2) → ℝ>0, the goal is to find a tree (or an ultrametric) T including all elements of set V, such that the difference between the distances among vertices in T and those specified by D is minimized. Numerical taxonomy was first introduced by Sneath and Sokal [Nature 1962], and since then it has been studied extensively in both biology and computer science. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from V (with |V | “n), defined by the function D, can arrive in an arbitrary order. We study these problems under various distance norms; namely the ℓ0 objective, which aims to minimize the number of modified entries in D to fit a tree-metric or an ultrametric; the ℓ1 objective, which seeks to minimize the total sum of distance errors across all pairs of points in V ; and the ℓ∞ objective, which focuses on minimizing the maximum error incurred by any entries in D. Our first result addresses the ℓ0 objective. We provide a single-pass polynomial-time Õ(n)-space O(1) approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for ℓ0 implies an O(∆/δ) approximation for the ℓ1 objective, where ∆ is the maximum, and δ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when ∆/δ “O(n). For the ℓ∞ objective, we provide a complete characterization of the ultrametric fitting problem. First, we present a single-pass polynomial-time Õ(n)-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. Furthermore, we show that with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend all these results to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.
UR - https://www.scopus.com/pages/publications/105009945851
UR - https://www.scopus.com/pages/publications/105009945851#tab=citedBy
U2 - 10.4230/LIPIcs.ICALP.2025.42
DO - 10.4230/LIPIcs.ICALP.2025.42
M3 - Conference contribution
AN - SCOPUS:105009945851
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 52nd International Colloquium on Automata, Languages, and Programming, ICALP 2025
A2 - Censor-Hillel, Keren
A2 - Grandoni, Fabrizio
A2 - Ouaknine, Joel
A2 - Puppis, Gabriele
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 52nd EATCS International Colloquium on Automata, Languages, and Programming, ICALP 2025
Y2 - 8 July 2025 through 11 July 2025
ER -