TY - GEN
T1 - Improved algorithms for clustering with outliers
AU - Feng, Qilong
AU - Zhang, Zhen
AU - Huang, Ziyun
AU - Xu, Jinhui
AU - Wang, Jianxin
N1 - Publisher Copyright:
© Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang; licensed under Creative Commons License CC-BY
PY - 2019/12
Y1 - 2019/12
N2 - Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space Rd for possibly high m and d. Our algorithm runs in O(nd(1 ε (k + m))(kε )O(1) ) time, which considerably improves the previous result (with running time O(nd(m + k)O(m+k) + (1 ε k log n)O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6 + ε)-approximation algorithm for general metric space with running time O(n(β 1 ε (k + m))k) for some constant β > 1. Our algorithm first uses the k-means++ technique to sample O(1 ε (k + m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios.
AB - Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space Rd for possibly high m and d. Our algorithm runs in O(nd(1 ε (k + m))(kε )O(1) ) time, which considerably improves the previous result (with running time O(nd(m + k)O(m+k) + (1 ε k log n)O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6 + ε)-approximation algorithm for general metric space with running time O(n(β 1 ε (k + m))k) for some constant β > 1. Our algorithm first uses the k-means++ technique to sample O(1 ε (k + m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios.
UR - http://www.scopus.com/inward/record.url?scp=85076374553&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076374553&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.ISAAC.2019.61
DO - 10.4230/LIPIcs.ISAAC.2019.61
M3 - Conference contribution
AN - SCOPUS:85076374553
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 30th International Symposium on Algorithms and Computation, ISAAC 2019
A2 - Lu, Pinyan
A2 - Zhang, Guochuan
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 30th International Symposium on Algorithms and Computation, ISAAC 2019
Y2 - 8 December 2019 through 11 December 2019
ER -