TY - GEN
T1 - Online Learning with Diverse User Preferences
AU - Gan, Chao
AU - Yang, Jing
AU - Zhou, Ruida
AU - Shen, Cong
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm is optimal for certain users, the O(log T ) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both user preferences and arm states are modeled as independent and identical distributed (i.i.d) d-dimensional random vectors. After receiving a user preference vector at the beginning of each time slot, the learner pulls an arm and receives a reward as the linear product of the preference vector and the arm state vector. We also assume that the state of the pulled arm is revealed to the learner once it is pulled. We propose a Weighted Upper Confidence Bound (W-UCB) algorithm and show that it can achieve a constant regret when the user preferences are sufficiently diverse. The performance of W-UCB under general setups is also completely characterized and validated with synthetic data.
AB - In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm is optimal for certain users, the O(log T ) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both user preferences and arm states are modeled as independent and identical distributed (i.i.d) d-dimensional random vectors. After receiving a user preference vector at the beginning of each time slot, the learner pulls an arm and receives a reward as the linear product of the preference vector and the arm state vector. We also assume that the state of the pulled arm is revealed to the learner once it is pulled. We propose a Weighted Upper Confidence Bound (W-UCB) algorithm and show that it can achieve a constant regret when the user preferences are sufficiently diverse. The performance of W-UCB under general setups is also completely characterized and validated with synthetic data.
UR - http://www.scopus.com/inward/record.url?scp=85073150075&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073150075&partnerID=8YFLogxK
U2 - 10.1109/ISIT.2019.8849646
DO - 10.1109/ISIT.2019.8849646
M3 - Conference contribution
AN - SCOPUS:85073150075
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 2539
EP - 2543
BT - 2019 IEEE International Symposium on Information Theory, ISIT 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE International Symposium on Information Theory, ISIT 2019
Y2 - 7 July 2019 through 12 July 2019
ER -