Parallel and memory-efficient preprocessing for metagenome assembly

Vasudevan Rengasamy, Paul Medvedev, Kamesh Madduri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

The analysis of high-throughput metagenomic sequencingdata poses significant computational challenges. Mostcurrent de novo assembly tools use the de Bruijn graph-basedmethodology. In prior work, a connected components decompositionof the de Bruijn graph and subsequent partitioningof sequence read data was shown to be an effective memory reducingpreprocessing step for de novo assembly of largemetagenomic datasets. In this paper, we present METAPREP, a new end-to-end parallel implementation of a similar preprocessingstep. METAPREP has efficient implementations ofseveral computational subroutines (e.g., k-mer enumerationand counting, parallel sorting, graph connectivity) that occurin other genomic data analysis problems, and we show thatour implementations are comparable to the state-of-the-art. METAPREP is primarily designed to execute on large shared memorymulticore servers, but scales gracefully to use multiplecompute nodes and clusters with parallel I/O capabilities. WithMETAPREP, we can process the Iowa Continuous Corn soilmetagenomics dataset, comprising 1.13 billion reads totaling223 billion base pairs, in around 14 minutes, using just 16 nodesof the NERSC Edison supercomputer. We also evaluate theperformance impact of METAPREP on MEGAHIT, a parallelmetagenome assembler.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages283-292
Number of pages10
ISBN (Electronic)9781538634080
DOIs
StatePublished - Jun 30 2017
Event31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017 - Orlando, United States
Duration: May 29 2017Jun 2 2017

Publication series

NameProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Other

Other31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
Country/TerritoryUnited States
CityOrlando
Period5/29/176/2/17

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Computer Networks and Communications
  • Information Systems

Fingerprint

Dive into the research topics of 'Parallel and memory-efficient preprocessing for metagenome assembly'. Together they form a unique fingerprint.

Cite this