TY - JOUR
T1 - Family reunion via error correction
T2 - An efficient analysis of duplex sequencing data
AU - Stoler, Nicholas
AU - Arbeithuber, Barbara
AU - Povysil, Gundula
AU - Heinzl, Monika
AU - Salazar, Renato
AU - Makova, Kateryna D.
AU - Tiemann-Boege, Irene
AU - Nekrutenko, Anton
N1 - Funding Information:
This study has been funded by the funds provided by the Eberly College of Science at the Pennsylvania State University and NIH Grants U41 HG006620 and R01 AI134384–01 as well as NSF ABI Grant 1661497. Funding for RS, MH, GP, and ITB was provided by the Linz Institute of Technology (LIT213201001) and the Austrian Science Fund (FWFP30867000). This project was also supported by a grant from NIH (R01GM116044) for K.D.M. and a Schrödinger Fellowship from the Austrian Science Fund (FWF) for B.A.: J-4096. Additional funding was provided by the Office of Science Engagement, Eberly College of Sciences, The Huck Institute of Life Sciences and the Institute for CyberScience at Penn State, as well as, in part, under grants from the Pennsylvania Department of Health using Tobacco Settlement and CURE Funds. The department specifically disclaims any responsibility for any analyses, responsibility or conclusions. Funding bodies did not participate in the collection, analysis and interpretation of data or writing the manuscript.
Publisher Copyright:
© 2020 The Author(s).
PY - 2020/3/4
Y1 - 2020/3/4
N2 - Background: Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost - sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away. Results: In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective. Conclusions: We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.
AB - Background: Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost - sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away. Results: In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows "reuniting" these reads with their respective families increasing the output of the method and making it more cost effective. Conclusions: We combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.
UR - http://www.scopus.com/inward/record.url?scp=85081123033&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081123033&partnerID=8YFLogxK
U2 - 10.1186/s12859-020-3419-8
DO - 10.1186/s12859-020-3419-8
M3 - Article
C2 - 32131723
AN - SCOPUS:85081123033
SN - 1471-2105
VL - 21
JO - BMC bioinformatics
JF - BMC bioinformatics
IS - 1
M1 - 96
ER -