TY - JOUR
T1 - K-means clustering and principal components analysis of microarray data of L1000 landmark genes
AU - Clayman, Carly L.
AU - Srinivasan, Satish M.
AU - Sangwan, Raghvinder S.
N1 - Funding Information:
This study was supported by funding from the School of Mathematics and Physical Sciences, University of Technology Sydney.
Publisher Copyright:
© 2020 The Authors.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing gene microarray data from 978 landmark genes, which have been previously shown to predict expression of ~81% of the remaining 21,290 target genes with low error. Groups within the L1000 dataset were characterized using both microarray and clinical metadata to assess whether 978 landmark genes would improve clustering results, compared to a random set of 978 genes. The role of clinical variables, including morphological diagnosis, were assessed across k-means clustering groups within homogeneous tissue samples in the L1000 dataset. Results show that the better differentiated k-means clusters, relative to 978 randomly selected non-landmark genes. K-means clusters generated from the landmark genes showed more separation of cluster groups when plotted against the first two principal components, which capture a greater proportion of variation for the 978 landmark genes. These results suggest that the 978 landmark genes better represent the overall genetic profile of these heterogeneous samples. Future studies will implement predictive analytics techniques to further investigate the interaction of microarray data and clinical variables such as cancer stage.
AB - Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing gene microarray data from 978 landmark genes, which have been previously shown to predict expression of ~81% of the remaining 21,290 target genes with low error. Groups within the L1000 dataset were characterized using both microarray and clinical metadata to assess whether 978 landmark genes would improve clustering results, compared to a random set of 978 genes. The role of clinical variables, including morphological diagnosis, were assessed across k-means clustering groups within homogeneous tissue samples in the L1000 dataset. Results show that the better differentiated k-means clusters, relative to 978 randomly selected non-landmark genes. K-means clusters generated from the landmark genes showed more separation of cluster groups when plotted against the first two principal components, which capture a greater proportion of variation for the 978 landmark genes. These results suggest that the 978 landmark genes better represent the overall genetic profile of these heterogeneous samples. Future studies will implement predictive analytics techniques to further investigate the interaction of microarray data and clinical variables such as cancer stage.
UR - http://www.scopus.com/inward/record.url?scp=85093087241&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093087241&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2020.02.265
DO - 10.1016/j.procs.2020.02.265
M3 - Conference article
AN - SCOPUS:85093087241
SN - 1877-0509
VL - 168
SP - 97
EP - 104
JO - Procedia Computer Science
JF - Procedia Computer Science
T2 - 2020 Complex Adaptive Systems Conference, CAS 2019
Y2 - 13 November 2019 through 15 November 2019
ER -