TY - JOUR
T1 - Improving MinHash via the containment index with applications to metagenomic analysis
AU - Koslicki, David
AU - Zabeti, Hooman
N1 - Funding Information:
This work is partially supported by the NSF under the DMS CDS&E-MSS award number 1664803 .
Funding Information:
This work is partially supported by the NSF under the DMS CDS&E-MSS award number 1664803.
Publisher Copyright:
© 2019 Elsevier Inc.
PY - 2019/8/1
Y1 - 2019/8/1
N2 - MinHash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ratio of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very different size. We introduce a new and efficient approach, called the containment MinHash approach, that is more suitable for estimating the Jaccard index of sets of very different size. We accomplish this by leveraging another probabilistic method (in particular, Bloom filters) for fast membership queries. We derive bounds on the probability of estimate errors for the containment MinHash approach and show it significantly improves upon the classical MinHash approach. We also show significant improvements in terms of time and space complexity. As an application, we use this method to detect the presence/absence of organisms in a metagenomic data set, showing that it can detect the presence of very small, low abundance microorganisms.
AB - MinHash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ratio of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very different size. We introduce a new and efficient approach, called the containment MinHash approach, that is more suitable for estimating the Jaccard index of sets of very different size. We accomplish this by leveraging another probabilistic method (in particular, Bloom filters) for fast membership queries. We derive bounds on the probability of estimate errors for the containment MinHash approach and show it significantly improves upon the classical MinHash approach. We also show significant improvements in terms of time and space complexity. As an application, we use this method to detect the presence/absence of organisms in a metagenomic data set, showing that it can detect the presence of very small, low abundance microorganisms.
UR - http://www.scopus.com/inward/record.url?scp=85062208053&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062208053&partnerID=8YFLogxK
U2 - 10.1016/j.amc.2019.02.018
DO - 10.1016/j.amc.2019.02.018
M3 - Article
AN - SCOPUS:85062208053
SN - 0096-3003
VL - 354
SP - 206
EP - 215
JO - Applied Mathematics and Computation
JF - Applied Mathematics and Computation
ER -