TY - GEN
T1 - Evolving Strategies for Focused Web Crawling
AU - Johnson, Judy
AU - Tsioutsiouliklis, Kostas
AU - Lee Giles, C.
PY - 2003
Y1 - 2003
N2 - The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl frontier is required. Such a strategy can only use information gleaned from previously crawled pages to estimate the relevance of a newly observed URL. Because the best strategy for ranking URLs in the crawl frontier is not immediately apparent, we discover strategies by evolving them using a genetic algorithm. Strategies are learned by evaluating the results of crawls simulated using a database generated by a previous, more general crawl. We conclude that a rank function that combines analysis of text and link structure yields effective strategies. The evolved strategies perform better than the commonly used Best First strategy.
AB - The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl frontier is required. Such a strategy can only use information gleaned from previously crawled pages to estimate the relevance of a newly observed URL. Because the best strategy for ranking URLs in the crawl frontier is not immediately apparent, we discover strategies by evolving them using a genetic algorithm. Strategies are learned by evaluating the results of crawls simulated using a database generated by a previous, more general crawl. We conclude that a rank function that combines analysis of text and link structure yields effective strategies. The evolved strategies perform better than the commonly used Best First strategy.
UR - http://www.scopus.com/inward/record.url?scp=1942484949&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=1942484949&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:1942484949
SN - 1577351894
T3 - Proceedings, Twentieth International Conference on Machine Learning
SP - 298
EP - 305
BT - Proceedings, Twentieth International Conference on Machine Learning
A2 - Fawcett, T.
A2 - Mishra, N.
T2 - Proceedings, Twentieth International Conference on Machine Learning
Y2 - 21 August 2003 through 24 August 2003
ER -