Evolving Strategies for Focused Web Crawling

Judy Johnson, Kostas Tsioutsiouliklis, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

41 Scopus citations

Abstract

The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl frontier is required. Such a strategy can only use information gleaned from previously crawled pages to estimate the relevance of a newly observed URL. Because the best strategy for ranking URLs in the crawl frontier is not immediately apparent, we discover strategies by evolving them using a genetic algorithm. Strategies are learned by evaluating the results of crawls simulated using a database generated by a previous, more general crawl. We conclude that a rank function that combines analysis of text and link structure yields effective strategies. The evolved strategies perform better than the commonly used Best First strategy.

Original languageEnglish (US)
Title of host publicationProceedings, Twentieth International Conference on Machine Learning
EditorsT. Fawcett, N. Mishra
Pages298-305
Number of pages8
StatePublished - 2003
EventProceedings, Twentieth International Conference on Machine Learning - Washington, DC, United States
Duration: Aug 21 2003Aug 24 2003

Publication series

NameProceedings, Twentieth International Conference on Machine Learning
Volume1

Other

OtherProceedings, Twentieth International Conference on Machine Learning
Country/TerritoryUnited States
CityWashington, DC
Period8/21/038/24/03

All Science Journal Classification (ASJC) codes

  • General Engineering

Fingerprint

Dive into the research topics of 'Evolving Strategies for Focused Web Crawling'. Together they form a unique fingerprint.

Cite this