TY - JOUR
T1 - AllSome Sequence Bloom Trees
AU - Sun, Chen
AU - Harris, Robert S.
AU - Chikhi, Rayan
AU - Medvedev, Paul
N1 - Funding Information:
This work was supported, in part, by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to P.M.
Publisher Copyright:
© Mary Ann Liebert, Inc.
PY - 2018/5
Y1 - 2018/5
N2 - The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.
AB - The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.
UR - http://www.scopus.com/inward/record.url?scp=85046906307&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046906307&partnerID=8YFLogxK
U2 - 10.1089/cmb.2017.0258
DO - 10.1089/cmb.2017.0258
M3 - Article
C2 - 29620920
AN - SCOPUS:85046906307
SN - 1066-5277
VL - 25
SP - 467
EP - 479
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 5
ER -