Fitting Tree Metrics and Ultrametrics in Data Streams

  • Amir Carmel
  • , Debarati Das
  • , Evangelos Kipouridis
  • , Evangelos Pipis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Formally, given a positive distance function D : (V 2) → ℝ>0, the goal is to find a tree (or an ultrametric) T including all elements of set V, such that the difference between the distances among vertices in T and those specified by D is minimized. Numerical taxonomy was first introduced by Sneath and Sokal [Nature 1962], and since then it has been studied extensively in both biology and computer science. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from V (with |V | “n), defined by the function D, can arrive in an arbitrary order. We study these problems under various distance norms; namely the ℓ0 objective, which aims to minimize the number of modified entries in D to fit a tree-metric or an ultrametric; the ℓ1 objective, which seeks to minimize the total sum of distance errors across all pairs of points in V ; and the ℓ objective, which focuses on minimizing the maximum error incurred by any entries in D. Our first result addresses the ℓ0 objective. We provide a single-pass polynomial-time Õ(n)-space O(1) approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for ℓ0 implies an O(∆/δ) approximation for the ℓ1 objective, where ∆ is the maximum, and δ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when ∆/δ “O(n). For the ℓ objective, we provide a complete characterization of the ultrametric fitting problem. First, we present a single-pass polynomial-time Õ(n)-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. Furthermore, we show that with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend all these results to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.

Original languageEnglish (US)
Title of host publication52nd International Colloquium on Automata, Languages, and Programming, ICALP 2025
EditorsKeren Censor-Hillel, Fabrizio Grandoni, Joel Ouaknine, Gabriele Puppis
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959773720
DOIs
StatePublished - Jun 30 2025
Event52nd EATCS International Colloquium on Automata, Languages, and Programming, ICALP 2025 - Aarhus, Denmark
Duration: Jul 8 2025Jul 11 2025

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume334
ISSN (Print)1868-8969

Conference

Conference52nd EATCS International Colloquium on Automata, Languages, and Programming, ICALP 2025
Country/TerritoryDenmark
CityAarhus
Period7/8/257/11/25

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Fitting Tree Metrics and Ultrametrics in Data Streams'. Together they form a unique fingerprint.

Cite this