This award is funded in whole or in part under the American Rescue Plan Act of 2021 (Public Law 117-2).
Many organisms, including humans, have two sets of chromosomes, one from the mother and one from the father, that exist as homologous pairs. It is commonly observed that, the two distinct genes (or alleles) at the same location of two homologous chromosomes, may produce imbalanced gene products (i.e., mRNAs). This phenomenon is called allele-specific expression (ASE). ASE has been known to be closely related to multiple phenotypes and can contribute to cancer susceptibility. ASEs offer an important source of biomarkers that could be potentially used as phenotypic biomarkers or for disease diagnosis. Additionally, ASE analysis serves as a powerful analytical tool to determine expression quantitative trait locus (eQTL) and to study a variety of biological processes such as imprinting, protein-truncating variants, and X-chromosome inactivation. The recently established RNA-sequencing technology (RNA-seq) provides an accurate and efficient way to quantitatively measure ASE. However, the sequencing reads generated from current technologies are not full-length. Hence, computational methods are needed to reconstruct the full-length mRNAs expressed from the two different alleles that exist on homologous chromosomes, a problem referred to as allele-specific transcript assembly. Allele-specific transcript assembly is exceedingly difficult.
Allel-specific transcript assembly is difficult because it requires simultaneously threading mutations and splice junctions while inferring unknown number of full-length transcripts and their abundances. This project aims to develop accurate allele-specific transcript assembly methods that are applicable to short-reads, long-reads, and single-cell RNA-seq data. Specifically, these investigators first tackle how to use phased SNPs (i.e., mutations) to improve allele-specific assembly. they show that, phased SNPs can be equivalently represented as incompatible pairs of vertices in a so-called variant splice graph. Then heuristics are proposed to solve a formulation with incompatible pairs included. Long-range information will be used in paired-/multi-end RNA-seq data to improve allele-specific assembly, with an algorithm that decomposes the variant splice graph into paths while fully preserving the paired-/multi-end constraints. The allele-specific assembly will be solved in the presence of structure variations, a crucial scenario in studying cancer. A new data structure will model SNPs, alternative splicing, and structure variations all together. A new algorithms is also proposed to identify allele-specific structure variations, which is of independent interests but also leads to a two-step algorithm for allele-specific assembly. Covariate-adaptive multiple hypothesis testing will control false positive rates. Implementing these algorithms result in accurate allele-specific transcript assemblers for various types of data and a new ASE analysis pipeline for broader use. The proposed research is well integrated with educational activities. High-school curricula will be developed that focus on using graph structure--a key abstract in mathematics and computer science--to model biological data. High school teachers will be provided opportunities to conduct interdisciplinary research related to transcript assembly. New undergraduate course will be developed with a focus to enhance students' ability in modeling and solving real-world problems. Efforts will also be made to engage undergraduates and graduate students from underrepresented groups in research opportunities. The results of the project can be found at the PI's website: https://sites.psu.edu/mxs2589.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|Effective start/end date||7/1/22 → 6/30/27|
- National Science Foundation: $602,510.00