TY - GEN
T1 - The evolution of a crawling strategy for an academic document search engine
T2 - 4th Annual ACM Web Science Conference, WebSci 2012
AU - Wu, Jian
AU - Teregowda, Pradeep
AU - Ramírez, Juan Pablo Fernández
AU - Mitra, Prasenjit
AU - Zheng, Shuyi
AU - Giles, C. Lee
PY - 2012
Y1 - 2012
N2 - We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.
AB - We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled. The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.
UR - http://www.scopus.com/inward/record.url?scp=84869071720&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84869071720&partnerID=8YFLogxK
U2 - 10.1145/2380718.2380762
DO - 10.1145/2380718.2380762
M3 - Conference contribution
AN - SCOPUS:84869071720
SN - 9781450312288
T3 - Proceedings of the 4th Annual ACM Web Science Conference, WebSci'12
SP - 340
EP - 343
BT - Proceedings of the 4th Annual ACM Web Science Conference, WebSci'12
PB - Association for Computing Machinery
Y2 - 22 June 2012 through 24 June 2012
ER -