Project Details
Description
Sequence count data (e.g., 16S rRNA sequencing or single-cell RNA-seq) are ubiquitous in modern
biomedical research. Yet even in the absence of measurement noise and limitations of experimental
design, these data convey limited information about the underlying biological system being measured.
Beyond familiar limitations such as inappropriate study design, two other forms of limitations have been
shown to impact or even dominate study conclusions. Scale limitations arise because the scale of the
system under study (e.g., the total number of bacteria in a persons gut) is typically independent of the scale
of the data. In contrast, measurement bias skews the observed distribution of counts as some entities are
systematically underrepresented compared to others. Despite an appreciation of these problems, we lack
tools for performing and evaluating analyses of sequence count data in light of these limitations. Here we
develop new statistical theory and tools for addressing measurement bias and scale limitations. This
proposal has 3 aims. (1) Develop a theoretical framework for objectively evaluating existing approaches in
light of these limitations. (2) Develop Simulated Inference as a new theoretical and computational
framework which allows analysts to use their preferred models and software while incorporating uncertainty
stemming from these data limitations. (3) Validate these tools through application to three case-studies of
real sequence count data. In total, these aims provide new theoretical and computational tools for
evaluating and performing analyses of sequence count data that are robust to these data limitations. The
proposed work is also a substantial departure from the status quo. In contrast to existing methods which
address these data limitations through assumptions that are often implicit, we develop statistical theory and
tools that explicitly model uncertainty and potential error in those assumptions. We demonstrate that this
approach can lead to lower Type-I and Type-II errors both in theory and in practice. Overall these tools will
enhance the reproducibility and rigor of sequence count data analysis which is central to projects across
the NIH.
RELEVANCE (See instructions):
DNA sequencing is used to profile the amount of different bacteria or the expression of different genes
within an organism. Yet limitations of the measurement process (e.g., measurement bias) restrict our ability
to use this data. This work will develop new statistical methods which enable scientists to account for these
data limitations and therefore to increase our understanding of human health and disease.
Status | Active |
---|---|
Effective start/end date | 9/20/22 → 8/31/25 |
Funding
- National Institute of General Medical Sciences: $199,851.00
- National Institute of General Medical Sciences: $199,851.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.