TY - JOUR
T1 - Variable selection for clustering by separability based on ridgelines
AU - Lee, Hyangmin
AU - Li, Jia
N1 - Funding Information:
We thank Bruce Lindsay and Don Richards for valuable discussions. The research is supported by NSF DMS-0705210 and CCF-0936948. We run most of the experiments using the CyberSTAR cluster computers at Penn State, supported by NSF OCI-0821527. The final revision of manuscript was done when Jia Li was a Program Director at the National Science Foundation in 2011. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the Foundation.
PY - 2012/6
Y1 - 2012/6
N2 - A new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. In this article, we allow one cluster to contain several components depending on whether theymerge into one mode. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation consists of the recently developed Modal expectation-maximization (MEM) algorithm which solves the modes of a Gaussian mixture density, and the Ridgeline expectation-maximization (REM) algorithm which solves the ridgeline passing through the critical points of the mixed density of two unimode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise cluster separability. Theoretical analysis of the procedure is provided. We experiment with both simulated and real datasets and compare with several state-of-the-art variable selection algorithms. Supplemental materials including an R-package, datasets, and appendices for proofs are available online.
AB - A new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. In this article, we allow one cluster to contain several components depending on whether theymerge into one mode. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation consists of the recently developed Modal expectation-maximization (MEM) algorithm which solves the modes of a Gaussian mixture density, and the Ridgeline expectation-maximization (REM) algorithm which solves the ridgeline passing through the critical points of the mixed density of two unimode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise cluster separability. Theoretical analysis of the procedure is provided. We experiment with both simulated and real datasets and compare with several state-of-the-art variable selection algorithms. Supplemental materials including an R-package, datasets, and appendices for proofs are available online.
UR - http://www.scopus.com/inward/record.url?scp=84862528405&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84862528405&partnerID=8YFLogxK
U2 - 10.1080/10618600.2012.679226
DO - 10.1080/10618600.2012.679226
M3 - Article
AN - SCOPUS:84862528405
SN - 1061-8600
VL - 21
SP - 315
EP - 336
JO - Journal of Computational and Graphical Statistics
JF - Journal of Computational and Graphical Statistics
IS - 2
ER -