TY - GEN
T1 - Parallel and memory-efficient preprocessing for metagenome assembly
AU - Rengasamy, Vasudevan
AU - Medvedev, Paul
AU - Madduri, Kamesh
N1 - Funding Information:
This research is supported in part by NSF awards #1453527, #1356529, and #1439057. This research used resources of the National Energy Research Scientific Com- puting Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.We thank Chita Das for providing access to the Ganga cluster and the reviewers for their helpful comments.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - The analysis of high-throughput metagenomic sequencingdata poses significant computational challenges. Mostcurrent de novo assembly tools use the de Bruijn graph-basedmethodology. In prior work, a connected components decompositionof the de Bruijn graph and subsequent partitioningof sequence read data was shown to be an effective memory reducingpreprocessing step for de novo assembly of largemetagenomic datasets. In this paper, we present METAPREP, a new end-to-end parallel implementation of a similar preprocessingstep. METAPREP has efficient implementations ofseveral computational subroutines (e.g., k-mer enumerationand counting, parallel sorting, graph connectivity) that occurin other genomic data analysis problems, and we show thatour implementations are comparable to the state-of-the-art. METAPREP is primarily designed to execute on large shared memorymulticore servers, but scales gracefully to use multiplecompute nodes and clusters with parallel I/O capabilities. WithMETAPREP, we can process the Iowa Continuous Corn soilmetagenomics dataset, comprising 1.13 billion reads totaling223 billion base pairs, in around 14 minutes, using just 16 nodesof the NERSC Edison supercomputer. We also evaluate theperformance impact of METAPREP on MEGAHIT, a parallelmetagenome assembler.
AB - The analysis of high-throughput metagenomic sequencingdata poses significant computational challenges. Mostcurrent de novo assembly tools use the de Bruijn graph-basedmethodology. In prior work, a connected components decompositionof the de Bruijn graph and subsequent partitioningof sequence read data was shown to be an effective memory reducingpreprocessing step for de novo assembly of largemetagenomic datasets. In this paper, we present METAPREP, a new end-to-end parallel implementation of a similar preprocessingstep. METAPREP has efficient implementations ofseveral computational subroutines (e.g., k-mer enumerationand counting, parallel sorting, graph connectivity) that occurin other genomic data analysis problems, and we show thatour implementations are comparable to the state-of-the-art. METAPREP is primarily designed to execute on large shared memorymulticore servers, but scales gracefully to use multiplecompute nodes and clusters with parallel I/O capabilities. WithMETAPREP, we can process the Iowa Continuous Corn soilmetagenomics dataset, comprising 1.13 billion reads totaling223 billion base pairs, in around 14 minutes, using just 16 nodesof the NERSC Edison supercomputer. We also evaluate theperformance impact of METAPREP on MEGAHIT, a parallelmetagenome assembler.
UR - http://www.scopus.com/inward/record.url?scp=85028080789&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028080789&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2017.159
DO - 10.1109/IPDPSW.2017.159
M3 - Conference contribution
AN - SCOPUS:85028080789
T3 - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
SP - 283
EP - 292
BT - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
Y2 - 29 May 2017 through 2 June 2017
ER -