K-means clustering and principal components analysis of microarray data of L1000 landmark genes

Research output: Contribution to journalConference articlepeer-review

21 Scopus citations


Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing gene microarray data from 978 landmark genes, which have been previously shown to predict expression of ~81% of the remaining 21,290 target genes with low error. Groups within the L1000 dataset were characterized using both microarray and clinical metadata to assess whether 978 landmark genes would improve clustering results, compared to a random set of 978 genes. The role of clinical variables, including morphological diagnosis, were assessed across k-means clustering groups within homogeneous tissue samples in the L1000 dataset. Results show that the better differentiated k-means clusters, relative to 978 randomly selected non-landmark genes. K-means clusters generated from the landmark genes showed more separation of cluster groups when plotted against the first two principal components, which capture a greater proportion of variation for the 978 landmark genes. These results suggest that the 978 landmark genes better represent the overall genetic profile of these heterogeneous samples. Future studies will implement predictive analytics techniques to further investigate the interaction of microarray data and clinical variables such as cancer stage.

Original languageEnglish (US)
Pages (from-to)97-104
Number of pages8
JournalProcedia Computer Science
StatePublished - 2020
Event2020 Complex Adaptive Systems Conference, CAS 2019 - Malvern, United States
Duration: Nov 13 2019Nov 15 2019

All Science Journal Classification (ASJC) codes

  • General Computer Science


Dive into the research topics of 'K-means clustering and principal components analysis of microarray data of L1000 landmark genes'. Together they form a unique fingerprint.

Cite this