Near-Linear Time Approximation Algorithms for k-means with Outliers

Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang

Research output: Contribution to journalConference articlepeer-review

Abstract

The k-means with outliers problem is one of the most extensively studied clustering problems in the field of machine learning, where the goal is to discard up to z outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio ∆ (the ratio between the maximum and the minimum pairwise distances) to achieve fast approximations. To address the issue of aspect ratio dependency on the running time, we propose sampling-based algorithms with almost linear running time in the data size, where a crucial component of our approach is an algorithm called Fast-Sampling. Fast-Sampling algorithm can find inliers that well approximate the optimal clustering centers without relying on a guess for the optimal clustering costs, where a 4-approximate solution can be obtained in time O(ndk log log n/∊2) with O(k/∊) centers opened and (1 + ∊)z outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, where an O(1/∊)-approximate solution can be obtained in time O(ndk log log n/∊2 + dpoly(k, 1/∊) log(n∆)) with (1 + ∊)z outliers discarded and exactly k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem.

Original languageEnglish (US)
Pages (from-to)19723-19756
Number of pages34
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: Jul 21 2024Jul 27 2024

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Near-Linear Time Approximation Algorithms for k-means with Outliers'. Together they form a unique fingerprint.

Cite this