Leveraging k-mer sketching statistics to enhance metagenomic methods and alignment algorithms

Project: Research project

Project Details

Description

Project Summary In the face of increasing data sizes, sketching techniques such as MinHash sketching and its winnowed version have been among the most effective in facilitating scalabile analysis. Frequently though, bioinformatic algorithms using these techniques do not account for the randomness inherent in both the sketching process and in the mutation processes that generate the data (e.g. sequencing errors or evolutionary mutations). This project directly addresses this limitation by laying the statistical foundations for how these sketching approaches interact with mutation processes and k-mer based techniques, resulting in new algorithms for important biomedical problems. Aim 1 derives, for the first time, confidence and prediction intervals for frequently utilized sketching-based bioinformatics quantities that until now existed only as point estimates.To do so, it relies on sophisticated techniques from probability theory. The mathematical foundations laid by Aim 1 will not only help us achieve the biological aims of this proposal, but will also serve as a basis for quantifying the performance of future sketching-based bioinformatics algorithms. Aim 2 will then use these results to develop the first metagenomic taxonomic profiling algorithm that accounts for the uncertainty present when predicting the presence and relative abundance of microorganisms in a sample. This will resolve a long-standing issue in this field by providing researchers an informed way to filter their noisy data without sacrificing sensitivity, thereby facilitating biomedical discoveries (e.g. novel CRISPR systems). In addition, this aim will result in the first scalable method to quickly estimate the fraction of a metagenomic sample that is not described by current reference databases, thus illuminating which datasets contain the highest quantity of novel genetic material and hence possibility for biological discovery (e.g. novel antibiotics). Aim 2 will be achieved using techniques from compressive sensing as well as probability theory. Aim 3 will both use and extend the results of Aim 1 to quantifiably improve one of the most fundamental tools in a computational biologist’s toolkit: sequence alignment. This will equip modern sequence aligners with much needed significance scores and confidence intervals, as well as allow for the automatic selection of parameter settings to achieve a desired precision or recall. Due to their ubiquity in biomedical research, even a small improvement in the accuracy and features of an aligner will have tremendous impact. Aim 3 will be achieved using techniques from probabilistic algorithms. Finally, the long-term objective of this proposal is to provide researchers a toolkit that enables the development of scalable k-mer-based sketching algorithms without sacrificing their ability to quantify statistical significance.
StatusActive
Effective start/end date8/2/22 → 5/31/25

Funding

  • National Institute of General Medical Sciences: $443,474.00
  • National Institute of General Medical Sciences: $443,080.00
  • National Institute of General Medical Sciences: $443,474.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.