TY - GEN
T1 - Designing efficient sampling techniques to detect webpage updates
AU - Tan, Qingzhao
AU - Zhuang, Ziming
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2007
Y1 - 2007
N2 - Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.
AB - Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.
UR - http://www.scopus.com/inward/record.url?scp=35348878593&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35348878593&partnerID=8YFLogxK
U2 - 10.1145/1242572.1242738
DO - 10.1145/1242572.1242738
M3 - Conference contribution
AN - SCOPUS:35348878593
SN - 1595936548
SN - 9781595936547
T3 - 16th International World Wide Web Conference, WWW2007
SP - 1147
EP - 1148
BT - 16th International World Wide Web Conference, WWW2007
T2 - 16th International World Wide Web Conference, WWW2007
Y2 - 8 May 2007 through 12 May 2007
ER -