Project Details


Sequence count data (e.g., 16S rRNA sequencing or single-cell RNA-seq) are ubiquitous in modern biomedical research. Yet even in the absence of measurement noise and limitations of experimental design, these data convey limited information about the underlying biological system being measured. Beyond familiar limitations such as inappropriate study design, two other forms of limitations have been shown to impact or even dominate study conclusions. Scale limitations arise because the scale of the system under study (e.g., the total number of bacteria in a persons gut) is typically independent of the scale of the data. In contrast, measurement bias skews the observed distribution of counts as some entities are systematically underrepresented compared to others. Despite an appreciation of these problems, we lack tools for performing and evaluating analyses of sequence count data in light of these limitations. Here we develop new statistical theory and tools for addressing measurement bias and scale limitations. This proposal has 3 aims. (1) Develop a theoretical framework for objectively evaluating existing approaches in light of these limitations. (2) Develop Simulated Inference as a new theoretical and computational framework which allows analysts to use their preferred models and software while incorporating uncertainty stemming from these data limitations. (3) Validate these tools through application to three case-studies of real sequence count data. In total, these aims provide new theoretical and computational tools for evaluating and performing analyses of sequence count data that are robust to these data limitations. The proposed work is also a substantial departure from the status quo. In contrast to existing methods which address these data limitations through assumptions that are often implicit, we develop statistical theory and tools that explicitly model uncertainty and potential error in those assumptions. We demonstrate that this approach can lead to lower Type-I and Type-II errors both in theory and in practice. Overall these tools will enhance the reproducibility and rigor of sequence count data analysis which is central to projects across the NIH. RELEVANCE (See instructions): DNA sequencing is used to profile the amount of different bacteria or the expression of different genes within an organism. Yet limitations of the measurement process (e.g., measurement bias) restrict our ability to use this data. This work will develop new statistical methods which enable scientists to account for these data limitations and therefore to increase our understanding of human health and disease.
Effective start/end date9/20/228/31/24


  • National Institute of General Medical Sciences: $199,851.00
  • National Institute of General Medical Sciences: $199,851.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.