TY - JOUR
T1 - RecoverY
T2 - K-mer-based read classification for Y-chromosome-specific sequencing and assembly
AU - Rangavittal, Samarth
AU - Harris, Robert S.
AU - Cechova, Monika
AU - Tomaszkiewicz, Marta
AU - Chikhi, Rayan
AU - Makova, Kateryna D.
AU - Medvedev, Paul
N1 - Publisher Copyright:
© The Author 2017. Published by Oxford University Press. All rights reserved.
PY - 2018/4/1
Y1 - 2018/4/1
N2 - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact [email protected] or [email protected] Supplementary informationSupplementary data are available at Bioinformatics online.
AB - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact [email protected] or [email protected] Supplementary informationSupplementary data are available at Bioinformatics online.
UR - http://www.scopus.com/inward/record.url?scp=85045830143&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85045830143&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx771
DO - 10.1093/bioinformatics/btx771
M3 - Article
C2 - 29194476
AN - SCOPUS:85045830143
SN - 1367-4803
VL - 34
SP - 1125
EP - 1131
JO - Bioinformatics
JF - Bioinformatics
IS - 7
ER -