TY - JOUR
T1 - Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to electronic health records
AU - The electronic Medical Records and Genomics (eMERGE) Network
AU - Crosslin, David R.
AU - Tromp, Gerard
AU - Burt, Amber
AU - Kim, Daniel Seung
AU - Verma, Shefali S.
AU - Lucas, Anastasia M.
AU - Bradford, Yuki
AU - Crawford, Dana C.
AU - Armasu, Sebastian M.
AU - Heit, John A.
AU - Hayes, M. Geoffrey
AU - Kuivaniemi, Helena
AU - Ritchie, Marylyn D.
AU - Jarvik, Gail P.
AU - De Andrade, Mariza
PY - 2014
Y1 - 2014
N2 - Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.
AB - Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.
UR - http://www.scopus.com/inward/record.url?scp=84917734018&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84917734018&partnerID=8YFLogxK
U2 - 10.3389/fgene.2014.00352
DO - 10.3389/fgene.2014.00352
M3 - Article
C2 - 25414722
AN - SCOPUS:84917734018
SN - 1664-8021
VL - 5
JO - Frontiers in Genetics
JF - Frontiers in Genetics
IS - SEP
M1 - 352
ER -