Fast Algorithms for Distributed k-Clustering with Outliers

Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

In this paper, we study the k-clustering problems with outliers in distributed setting. The current best results for the distributed k-center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio ∆ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria (14(1 + ε), 1 + ε)-approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria (14(1 + ε), 1 + ε)-approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the k-median/means problems with outliers, and can achieve (O(ε 12 ), 1 + ε)-approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.

Original languageEnglish (US)
Pages (from-to)13845-13868
Number of pages24
JournalProceedings of Machine Learning Research
Volume202
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: Jul 23 2023Jul 29 2023

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Cite this