TY - GEN
T1 - A large-scale study of robots.txt
AU - Sun, Yang
AU - Zhuang, Ziming
AU - Giles, C. Lee
N1 - Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.
AB - Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.
UR - http://www.scopus.com/inward/record.url?scp=35348856355&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35348856355&partnerID=8YFLogxK
U2 - 10.1145/1242572.1242726
DO - 10.1145/1242572.1242726
M3 - Conference contribution
AN - SCOPUS:35348856355
SN - 1595936548
SN - 9781595936547
T3 - 16th International World Wide Web Conference, WWW2007
SP - 1123
EP - 1124
BT - 16th International World Wide Web Conference, WWW2007
T2 - 16th International World Wide Web Conference, WWW2007
Y2 - 8 May 2007 through 12 May 2007
ER -