TY - GEN
T1 - Measuring the web crawler ethics
AU - Giles, C. Lee
AU - Sun, Yang
AU - Councill, Isaac G.
PY - 2010
Y1 - 2010
N2 - Web crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics based on their behaviors on web servers. We investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We propose a vector space model to represent crawler behavior and measure the ethics of web crawlers based on the behavior vectors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers' behaviors are ethical. However, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. We also measure the ethics of big search engine crawlers in terms of return on investment. The results show that Google has a higher score than other search engines for a US website but has a lower score than Baidu for Chinese websites.
AB - Web crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics based on their behaviors on web servers. We investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We propose a vector space model to represent crawler behavior and measure the ethics of web crawlers based on the behavior vectors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers' behaviors are ethical. However, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. We also measure the ethics of big search engine crawlers in terms of return on investment. The results show that Google has a higher score than other search engines for a US website but has a lower score than Baidu for Chinese websites.
UR - http://www.scopus.com/inward/record.url?scp=77954581912&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954581912&partnerID=8YFLogxK
U2 - 10.1145/1772690.1772824
DO - 10.1145/1772690.1772824
M3 - Conference contribution
AN - SCOPUS:77954581912
SN - 9781605587998
T3 - Proceedings of the 19th International Conference on World Wide Web, WWW '10
SP - 1101
EP - 1102
BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10
T2 - 19th International World Wide Web Conference, WWW2010
Y2 - 26 April 2010 through 30 April 2010
ER -