TY - GEN
T1 - Determining bias to search engines from robots.txt
AU - Sun, Yang
AU - Zhuang, Ziming
AU - Councill, Isaac G.
AU - Giles, C. Lee
PY - 2007/12/1
Y1 - 2007/12/1
N2 - Search engines largely rely on robots (i.e., crawlers or spiders) to collect information from the Web. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in afile called robots.txt. Ethical robots will follow the rules specified in robots.txt. Websites can explicitly specify an access preference for each robot by name. Such biases may lead to a "rich get richer" situation, in which a few popular search engines ultimately dominate the Web because they have preferred access to resources that are inaccessible to others. This issue is seldom addressed, although the robots.txt convention has become a de facto standard for robot regulation and search engines have become an indispensable tool for information access. We propose a metric to evaluate the degree of bias to which specific robots are subjected. We have investigated 7,593 websites covering education, government, news, and business domains, and collected 2,925 distinct robots.txt files. Results of content and statistical analysis of the data confirm that the robots of popular search engines and information portals, such as Google, Yahoo, and MSN, are generally favored by most of the websites we have sampled. The results also show a strong correlation between the search engine market share and the bias toward particular search engine robots.
AB - Search engines largely rely on robots (i.e., crawlers or spiders) to collect information from the Web. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in afile called robots.txt. Ethical robots will follow the rules specified in robots.txt. Websites can explicitly specify an access preference for each robot by name. Such biases may lead to a "rich get richer" situation, in which a few popular search engines ultimately dominate the Web because they have preferred access to resources that are inaccessible to others. This issue is seldom addressed, although the robots.txt convention has become a de facto standard for robot regulation and search engines have become an indispensable tool for information access. We propose a metric to evaluate the degree of bias to which specific robots are subjected. We have investigated 7,593 websites covering education, government, news, and business domains, and collected 2,925 distinct robots.txt files. Results of content and statistical analysis of the data confirm that the robots of popular search engines and information portals, such as Google, Yahoo, and MSN, are generally favored by most of the websites we have sampled. The results also show a strong correlation between the search engine market share and the bias toward particular search engine robots.
UR - http://www.scopus.com/inward/record.url?scp=48349091114&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=48349091114&partnerID=8YFLogxK
U2 - 10.1109/WI.2007.45
DO - 10.1109/WI.2007.45
M3 - Conference contribution
AN - SCOPUS:48349091114
SN - 0769530265
SN - 9780769530260
T3 - Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007
SP - 149
EP - 155
BT - Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007
T2 - IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007
Y2 - 2 November 2007 through 5 November 2007
ER -