Abstract
In this paper, we study the k-clustering problems with outliers in distributed setting. The current best results for the distributed k-center problem with outliers have quadratic local running time with communication cost dependent on the aspect ratio ∆ of the given instance, which may constraint the scalability of the algorithms for handling large-scale datasets. To achieve better communication cost for the problem with faster local running time, we propose an inliers-recalling sampling method, which avoids guessing the optimal radius of the given instance, and can achieve a 4-round bi-criteria (14(1 + ε), 1 + ε)-approximation with linear local running time in the data size and communication cost independent of the aspect ratio. To obtain a more practical algorithm for the problem, we propose another space-narrowing sampling method, which automatically adjusts the sample size to adapt to different outliers distributions on each machine, and can achieve a 2-round bi-criteria (14(1 + ε), 1 + ε)-approximation with communication cost independent of the number of outliers. We show that, if the data points are randomly partitioned across machines, our proposed sampling-based methods can be extended to the k-median/means problems with outliers, and can achieve (O(ε 12 ), 1 + ε)-approximation with communication cost independent of the number of outliers. Empirical experiments suggest that the proposed 2-round distributed algorithms outperform other state-of-the-art algorithms.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 13845-13868 |
| Number of pages | 24 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 202 |
| State | Published - 2023 |
| Event | 40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States Duration: Jul 23 2023 → Jul 29 2023 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability
Fingerprint
Dive into the research topics of 'Fast Algorithms for Distributed k-Clustering with Outliers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver