Project Details
Description
Project Summary
In the face of increasing data sizes, sketching techniques such as MinHash sketching and its winnowed
version have been among the most effective in facilitating scalabile analysis. Frequently though, bioinformatic
algorithms using these techniques do not account for the randomness inherent in both the sketching process
and in the mutation processes that generate the data (e.g. sequencing errors or evolutionary mutations). This
project directly addresses this limitation by laying the statistical foundations for how these sketching
approaches interact with mutation processes and k-mer based techniques, resulting in new algorithms for
important biomedical problems. Aim 1 derives, for the first time, confidence and prediction intervals for
frequently utilized sketching-based bioinformatics quantities that until now existed only as point estimates.To
do so, it relies on sophisticated techniques from probability theory. The mathematical foundations laid by Aim 1
will not only help us achieve the biological aims of this proposal, but will also serve as a basis for quantifying
the performance of future sketching-based bioinformatics algorithms. Aim 2 will then use these results to
develop the first metagenomic taxonomic profiling algorithm that accounts for the uncertainty present when
predicting the presence and relative abundance of microorganisms in a sample. This will resolve a
long-standing issue in this field by providing researchers an informed way to filter their noisy data without
sacrificing sensitivity, thereby facilitating biomedical discoveries (e.g. novel CRISPR systems). In addition, this
aim will result in the first scalable method to quickly estimate the fraction of a metagenomic sample that is not
described by current reference databases, thus illuminating which datasets contain the highest quantity of
novel genetic material and hence possibility for biological discovery (e.g. novel antibiotics). Aim 2 will be
achieved using techniques from compressive sensing as well as probability theory. Aim 3 will both use and
extend the results of Aim 1 to quantifiably improve one of the most fundamental tools in a computational
biologist’s toolkit: sequence alignment. This will equip modern sequence aligners with much needed
significance scores and confidence intervals, as well as allow for the automatic selection of parameter settings
to achieve a desired precision or recall. Due to their ubiquity in biomedical research, even a small improvement
in the accuracy and features of an aligner will have tremendous impact. Aim 3 will be achieved using
techniques from probabilistic algorithms. Finally, the long-term objective of this proposal is to provide
researchers a toolkit that enables the development of scalable k-mer-based sketching algorithms without
sacrificing their ability to quantify statistical significance.
Status | Active |
---|---|
Effective start/end date | 8/2/22 → 5/31/25 |
Funding
- National Institute of General Medical Sciences: $443,474.00
- National Institute of General Medical Sciences: $443,080.00
- National Institute of General Medical Sciences: $443,474.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.