TY - GEN
T1 - De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm
AU - Sahlin, Kristoffer
AU - Medvedev, Paul
N1 - Funding Information:
Acknowledgements. This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.
Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.
AB - Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.
UR - http://www.scopus.com/inward/record.url?scp=85065538920&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85065538920&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-17083-7_14
DO - 10.1007/978-3-030-17083-7_14
M3 - Conference contribution
AN - SCOPUS:85065538920
SN - 9783030170820
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 227
EP - 242
BT - Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings
A2 - Cowen, Lenore J.
PB - Springer Verlag
T2 - 23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019
Y2 - 5 May 2019 through 8 May 2019
ER -