TY - GEN
T1 - Cost-aware cascading bandits
AU - Zhou, Ruida
AU - Gan, Chao
AU - Yang, Jing
AU - Shen, Cong
N1 - Publisher Copyright:
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved.
PY - 2018
Y1 - 2018
N2 - In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in O(log T). We also provide a lower bound for all α-consistent policies, which scales in Ω(log T) and matches our upper bound. The performance of the CC-UCB algorithm is evaluated with both synthetic and real-world data.
AB - In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in O(log T). We also provide a lower bound for all α-consistent policies, which scales in Ω(log T) and matches our upper bound. The performance of the CC-UCB algorithm is evaluated with both synthetic and real-world data.
UR - http://www.scopus.com/inward/record.url?scp=85055717265&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85055717265&partnerID=8YFLogxK
U2 - 10.24963/ijcai.2018/448
DO - 10.24963/ijcai.2018/448
M3 - Conference contribution
AN - SCOPUS:85055717265
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 3228
EP - 3234
BT - Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
A2 - Lang, Jerome
PB - International Joint Conferences on Artificial Intelligence
T2 - 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
Y2 - 13 July 2018 through 19 July 2018
ER -