TY - JOUR
T1 - Reproducibility of mass spectrometry based metabolomics data
AU - Ghosh, Tusharkanti
AU - Philtron, Daisy
AU - Zhang, Weiming
AU - Kechris, Katerina
AU - Ghosh, Debashis
N1 - Funding Information:
Data files for all benchmark data sets are available from Metabolomics Workbench at https://www.metabolomicsworkbench.org with Project ID PR000438 (BioTech data set) and Project ID PR000907 (Bio data set). The Bio data set can be accessed directly via its Project DOI: 10.21228/M8FQ2X. The BioTech data set can be accessed directly via its Project DOI: 10.21228/M8FC7C. This work is supported by NIH grant, U2C DK119886. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr []. The marr package includes comprehensive help files for each function, as well as a package vignette demonstrating a complete example study design. A flowchart of the marr Bioconductor package is presented in the Additional file : Fig. S26. Code scripts to reproduce all data preparation, simulation steps, generate all figures and tables are available from GitHub at https://github.com/Ghoshlab/marr_paper_evaluations . The development version of the marr R package is available from GitHub at https://github.com/Ghoshlab/marr . In terms of computational time, on a desktop with a 3.2 GHz processor and 16 GB memory, marr takes approximately 12 min 47 s to analyze a MS-Metabolomics data with 662 metabolites (features) and (i.e., 77028) sample pairs. marr takes approximately 13 min 02 s to analyze a MS-Metabolomics data with 2860 metabolites (features) and (i.e., 77028) sample pairs. The computational complexity mainly depends on the sample pairs not on the number of metabolites (features). We have developed a Shiny-based Web application, called marr Shiny, for dynamic interaction with MS-metabolomics data that can run on any Web browser and requires no prior programming knowledge [ https://maxmcgrath.shinyapps.io/marr/ ]. Illustrative screenshots of the marr Shiny app pipeline are in the Additional file : Figs. S27 and S28.
Funding Information:
Research reported in this paper was supported by NCI and NHLBI of the NIH under award numbers U01 CA235488, P20 HL113445, U01 HL089897 and U01 HL089856. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12
Y1 - 2021/12
N2 - Background: Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results: We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions: This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.
AB - Background: Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results: We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions: This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.
UR - http://www.scopus.com/inward/record.url?scp=85114405347&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114405347&partnerID=8YFLogxK
U2 - 10.1186/s12859-021-04336-9
DO - 10.1186/s12859-021-04336-9
M3 - Article
C2 - 34493210
AN - SCOPUS:85114405347
SN - 1471-2105
VL - 22
JO - BMC bioinformatics
JF - BMC bioinformatics
IS - 1
M1 - 423
ER -