Graph-based seed selection for web-scale crawlers

Shuyi Zheng, Pavel Dmitriev, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations

Abstract

One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.

Original languageEnglish (US)
Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Pages1967-1970
Number of pages4
DOIs
StatePublished - 2009
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: Nov 2 2009Nov 6 2009

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Country/TerritoryChina
CityHong Kong
Period11/2/0911/6/09

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Fingerprint

Dive into the research topics of 'Graph-based seed selection for web-scale crawlers'. Together they form a unique fingerprint.

Cite this