Project Details
Description
Project Summary/Abstract
A fundamental question in genomics is to understand natural selection on coding and noncoding sequences.
Signatures of natural selection encoded in polymorphism and divergence data not only elucidate the patterns of
evolution but also pinpoint deleterious genetic variants responsible for genetic disorders. While numerous com-
putational methods have been developed to infer sequences under various types of natural selection, the existing
methods suffer from two critical limitations. First, most of the methods for inferring natural selection focus on an-
alyzing individual loci. Due to the intrinsic sparsity of polymorphism and divergence data, the single-locus-based
approaches are often underpowered. Second, when multiple genomic features are correlated with signatures
of natural selection, the existing methods are incapable of distinguishing causal genomic features from corre-
lated confounders. Due to these limitations, we still lack powerful computational frameworks to identify loci and
genomic features responsible for natural selection. During the next five years, l will address the limitations of exist-
ing methods by combining evolutionary models and flexible machine learning techniques. Specifically, I formulate
the inference of natural selection as a special regression problem in which genomic features are input covariates
whereas polymorphism and divergence data are response variables. Based on this idea, my lab will develop
a suite of evolution-guided machine learning models to infer negative, positive, and lineage-specific selection.
These customized machine learning models will boost the statistical power of selection inference by pooling data
across large numbers of loci, and will be able to distinguish genomic determinants from confounders. These
new models will be applied to investigate various types of natural selection in the human genome. In addition, a
genome-wide map of deleterious variants under strong negative selection will be developed for accurate variant
prioritization. The proposed research builds on my recent work for predicting functional noncoding sequences,
inferring selection coefficients of coding variants, and unifying variant-level and gene-level prioritization methods.
It will illustrate new insights into genomic determinants of functional sequences and human adaptive evolution,
and will provide powerful computational tools for identifying disease mutations. It could also serve as a basis for
the emerging paradigm of combining classical evolutionary theory and machine learning methods to address a
variety of questions in evolutionary biology.
Status | Finished |
---|---|
Effective start/end date | 8/10/21 → 6/30/24 |
Funding
- National Institute of General Medical Sciences: $382,268.00
- National Institute of General Medical Sciences: $373,053.00
- National Institute of General Medical Sciences: $373,274.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.