TY - GEN
T1 - SearchGen
T2 - 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
AU - Li, Huajing
AU - Lee, Wang Chien
AU - Sivasubramaniam, Anand
AU - Giles, Lee
PY - 2007
Y1 - 2007
N2 - Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.
AB - Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.
UR - http://www.scopus.com/inward/record.url?scp=36349033571&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36349033571&partnerID=8YFLogxK
U2 - 10.1145/1255175.1255203
DO - 10.1145/1255175.1255203
M3 - Conference contribution
AN - SCOPUS:36349033571
SN - 1595936440
SN - 9781595936448
T3 - Proceedings of the ACM International Conference on Digital Libraries
SP - 137
EP - 146
BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Y2 - 18 June 2007 through 23 June 2007
ER -