TY - GEN
T1 - Mining a search engine's corpus
T2 - 2011 ACM SIGMOD and 30th PODS 2011 Conference
AU - Zhang, Mingyang
AU - Zhang, Nan
AU - Das, Gautam
PY - 2011
Y1 - 2011
N2 - Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information of analytical interest to third parties, but the only available access is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem, as the estimation error and query cost can both be large for certain aggregates. We propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.
AB - Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information of analytical interest to third parties, but the only available access is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem, as the estimation error and query cost can both be large for certain aggregates. We propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.
UR - http://www.scopus.com/inward/record.url?scp=79959940475&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959940475&partnerID=8YFLogxK
U2 - 10.1145/1989323.1989406
DO - 10.1145/1989323.1989406
M3 - Conference contribution
AN - SCOPUS:79959940475
SN - 9781450306614
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 793
EP - 804
BT - Proceedings of SIGMOD 2011 and PODS 2011
PB - Association for Computing Machinery
Y2 - 12 June 2011 through 16 June 2011
ER -