TY - JOUR
T1 - Imputation and quality control steps for combining multiple genome-wide datasets
AU - Verma, Shefali S.
AU - de Andrade, Mariza
AU - Tromp, Gerard
AU - Kuivaniemi, Helena
AU - Pugh, Elizabeth
AU - Namjou-Khales, Bahram
AU - Mukherjee, Shubhabrata
AU - Jarvik, Gail P.
AU - Kottyan, Leah C.
AU - Burt, Amber
AU - Bradford, Yuki
AU - Armstrong, Gretta D.
AU - Derr, Kimberly
AU - Crawford, Dana C.
AU - Haines, Jonathan L.
AU - Li, Rongling
AU - Crosslin, David
AU - Ritchie, Marylyn D.
N1 - Publisher Copyright:
© 2014 Verma, de Andrade, Tromp, Kuivaniemi, Pugh, Namjou-Khales, Mukherjee, Jarvik, Kottyan, Burt, Bradford, Armstrong, Derr, Crawford, Haines, Li, Crosslin and Ritchie.
PY - 2014
Y1 - 2014
N2 - The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
AB - The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
UR - http://www.scopus.com/inward/record.url?scp=84917732232&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84917732232&partnerID=8YFLogxK
U2 - 10.3389/fgene.2014.00370
DO - 10.3389/fgene.2014.00370
M3 - Article
C2 - 25566314
AN - SCOPUS:84917732232
SN - 1664-8021
VL - 5
JO - Frontiers in Genetics
JF - Frontiers in Genetics
IS - DEC
M1 - 370
ER -