TY - JOUR
T1 - Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
AU - Huang, Yi Fei
AU - Siepel, Adam
N1 - Funding Information:
We thank Fernando Racimo and Joshua Schraiber for sharing their source code for inference of selection coefficients, Stephen Wright for providing access to a GPU server at the University of Toronto, Noah Dukler for help with figure preparation, and members of the Siepel laboratory for helpful discussions. This research was supported by U.S. National Institutes of Health grant R35-GM127070. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. National Institutes of Health.
Funding Information:
We thank Fernando Racimo and Joshua Schraiber for sharing their source code for inference of selection coefficients, StephenWright for providing access to a GPU server at the University of Toronto, Noah Dukler for help with figure preparation, and members of the Siepel laboratory for helpful discussions. This researchwas supported by U.S. National Institutes of Health grant R35-GM127070. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. National Institutes of Health.
Publisher Copyright:
© 2019 Huang and Siepel.
PY - 2019
Y1 - 2019
N2 - A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects of all observed and potential single-nucleotide variants, based on olymorphism data and predictive genomic features. We applied LASSIE to 51 high-coverage genome sequences annotated with 33 genomic features and constructed a map of allele-specific selection coefficients across all protein-coding sequences in the human genome. This map is generally consistent with previous inferences of the bulk distribution of fitness effects but reveals pervasive weak negative selection against synonymous mutations. In addition, the estimated selection coefficients are highly predictive of inherited pathogenic variants and cancer driver mutations, outperforming state-of-the-art variant prioritization methods. By contrasting our estimated model with ultrahigh coverage ExAC exome-sequencing data, we identified 1118 genes under unusually strong negative selection, which tend to be exclusively expressed in the central nervous system or associated with autism spectrum disorder, as well as 773 genes under unusually weak selection, which tend to be associated with metabolism. This combination of classical population genetic theory with modern machine-learning and large-scale genomic data is a powerful paradigm for the study of both human evolution and disease.
AB - A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects of all observed and potential single-nucleotide variants, based on olymorphism data and predictive genomic features. We applied LASSIE to 51 high-coverage genome sequences annotated with 33 genomic features and constructed a map of allele-specific selection coefficients across all protein-coding sequences in the human genome. This map is generally consistent with previous inferences of the bulk distribution of fitness effects but reveals pervasive weak negative selection against synonymous mutations. In addition, the estimated selection coefficients are highly predictive of inherited pathogenic variants and cancer driver mutations, outperforming state-of-the-art variant prioritization methods. By contrasting our estimated model with ultrahigh coverage ExAC exome-sequencing data, we identified 1118 genes under unusually strong negative selection, which tend to be exclusively expressed in the central nervous system or associated with autism spectrum disorder, as well as 773 genes under unusually weak selection, which tend to be associated with metabolism. This combination of classical population genetic theory with modern machine-learning and large-scale genomic data is a powerful paradigm for the study of both human evolution and disease.
UR - http://www.scopus.com/inward/record.url?scp=85071057285&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071057285&partnerID=8YFLogxK
U2 - 10.1101/gr.245522.118
DO - 10.1101/gr.245522.118
M3 - Article
C2 - 31249063
AN - SCOPUS:85071057285
SN - 1088-9051
VL - 29
SP - 1310
EP - 1321
JO - Genome research
JF - Genome research
IS - 8
ER -