TY - GEN
T1 - Graph-based seed selection for web-scale crawlers
AU - Zheng, Shuyi
AU - Dmitriev, Pavel
AU - Giles, C. Lee
PY - 2009
Y1 - 2009
N2 - One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.
AB - One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.
UR - http://www.scopus.com/inward/record.url?scp=74549193422&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74549193422&partnerID=8YFLogxK
U2 - 10.1145/1645953.1646277
DO - 10.1145/1645953.1646277
M3 - Conference contribution
AN - SCOPUS:74549193422
SN - 9781605585123
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1967
EP - 1970
BT - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
T2 - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Y2 - 2 November 2009 through 6 November 2009
ER -