Designing clustering-based web crawling policies for search engine crawlers

Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how to make good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering- based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

Original languageEnglish (US)
Title of host publicationCIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management
Pages535-544
Number of pages10
DOIs
StatePublished - 2007
Event16th ACM Conference on Information and Knowledge Management, CIKM 2007 - Lisboa, Portugal
Duration: Nov 6 2007Nov 9 2007

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other16th ACM Conference on Information and Knowledge Management, CIKM 2007
Country/TerritoryPortugal
CityLisboa
Period11/6/0711/9/07

All Science Journal Classification (ASJC) codes

  • General Business, Management and Accounting
  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Designing clustering-based web crawling policies for search engine crawlers'. Together they form a unique fingerprint.

Cite this