TY - GEN
T1 - Designing clustering-based web crawling policies for search engine crawlers
AU - Tan, Qingzhao
AU - Mitra, Prasenjit
AU - Lee Giles, C.
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.
AB - The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.
UR - http://www.scopus.com/inward/record.url?scp=63449100809&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=63449100809&partnerID=8YFLogxK
U2 - 10.1145/1321440.1321516
DO - 10.1145/1321440.1321516
M3 - Conference contribution
AN - SCOPUS:63449100809
SN - 9781595938039
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 535
EP - 544
BT - CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management
T2 - 16th ACM Conference on Information and Knowledge Management, CIKM 2007
Y2 - 6 November 2007 through 9 November 2007
ER -