TY - GEN
T1 - Learning contextual bandits in a non-stationary environment
AU - Wu, Qingyun
AU - Iyer, Naveen
AU - Wang, Hongning
N1 - Publisher Copyright:
© 2018 ACM.
PY - 2018/6/27
Y1 - 2018/6/27
N2 - Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually assume a stationary reward distribution, which hardly holds in practice as users' preferences are dynamic. This inevitably costs a recommender system consistent suboptimal performance. In this paper, we consider the situation where the underlying distribution of reward remains unchanged over (possibly short) epochs and shifts at unknown time instants. In accordance, we propose a contextual bandit algorithm that detects possible changes of environment based on its reward estimation confidence and updates its arm selection strategy respectively. Rigorous upper regret bound analysis of the proposed algorithm demonstrates its learning effectiveness in such a non-trivial environment. Extensive empirical evaluations on both synthetic and real-world datasets for recommendation confirm its practical utility in a changing environment.
AB - Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually assume a stationary reward distribution, which hardly holds in practice as users' preferences are dynamic. This inevitably costs a recommender system consistent suboptimal performance. In this paper, we consider the situation where the underlying distribution of reward remains unchanged over (possibly short) epochs and shifts at unknown time instants. In accordance, we propose a contextual bandit algorithm that detects possible changes of environment based on its reward estimation confidence and updates its arm selection strategy respectively. Rigorous upper regret bound analysis of the proposed algorithm demonstrates its learning effectiveness in such a non-trivial environment. Extensive empirical evaluations on both synthetic and real-world datasets for recommendation confirm its practical utility in a changing environment.
UR - http://www.scopus.com/inward/record.url?scp=85051466151&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85051466151&partnerID=8YFLogxK
U2 - 10.1145/3209978.3210051
DO - 10.1145/3209978.3210051
M3 - Conference contribution
AN - SCOPUS:85051466151
T3 - 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
SP - 495
EP - 504
BT - 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
PB - Association for Computing Machinery, Inc
T2 - 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
Y2 - 8 July 2018 through 12 July 2018
ER -