Informed and automated k-mer size selection for genome assembly

Rayan Chikhi, Paul Medvedev

Research output: Contribution to journalArticlepeer-review

520 Scopus citations

Abstract

Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.Availability: Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.Contact:

Original languageEnglish (US)
Pages (from-to)31-37
Number of pages7
JournalBioinformatics
Volume30
Issue number1
DOIs
StatePublished - Jan 1 2014

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint

Dive into the research topics of 'Informed and automated k-mer size selection for genome assembly'. Together they form a unique fingerprint.

Cite this