TY - GEN
T1 - Learning from Delayed Semi-Bandit Feedback under Strong Fairness Guarantees
AU - Steiger, Juaren
AU - Li, Bin
AU - Lu, Ning
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Multi-armed bandit frameworks, including combinatorial semi-bandits and sleeping bandits, are commonly employed to model problems in communication networks and other engineering domains. In such problems, feedback to the learning agent is often delayed (e.g. communication delays in a wireless network or conversion delays in online advertising). Moreover, arms in a bandit problem often represent entities required to be treated fairly, i.e. the arms should be played at least a required fraction of the time. In contrast to the previously studied asymptotic fairness, many real-time systems require such fairness guarantees to hold even in the short-term (e.g. ensuring the credibility of information flows in an industrial Internet of Things (IoT) system). To that end, we develop the Learning with Delays under Fairness (LDF) algorithm to solve combinatorial semi-bandit problems with sleeping arms and delayed feedback, which we prove guarantees strong (short-term) fairness. While previous theoretical work on bandit problems with delayed feedback typically derive instance-dependent regret bounds, this approach proves to be challenging when simultaneously considering fairness. We instead derive a novel instance-independent regret bound in this setting which agrees with state-of-the-art bounds. We verify our theoretical results with extensive simulations using both synthetic and real-world datasets.
AB - Multi-armed bandit frameworks, including combinatorial semi-bandits and sleeping bandits, are commonly employed to model problems in communication networks and other engineering domains. In such problems, feedback to the learning agent is often delayed (e.g. communication delays in a wireless network or conversion delays in online advertising). Moreover, arms in a bandit problem often represent entities required to be treated fairly, i.e. the arms should be played at least a required fraction of the time. In contrast to the previously studied asymptotic fairness, many real-time systems require such fairness guarantees to hold even in the short-term (e.g. ensuring the credibility of information flows in an industrial Internet of Things (IoT) system). To that end, we develop the Learning with Delays under Fairness (LDF) algorithm to solve combinatorial semi-bandit problems with sleeping arms and delayed feedback, which we prove guarantees strong (short-term) fairness. While previous theoretical work on bandit problems with delayed feedback typically derive instance-dependent regret bounds, this approach proves to be challenging when simultaneously considering fairness. We instead derive a novel instance-independent regret bound in this setting which agrees with state-of-the-art bounds. We verify our theoretical results with extensive simulations using both synthetic and real-world datasets.
UR - http://www.scopus.com/inward/record.url?scp=85133227374&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85133227374&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM48880.2022.9796683
DO - 10.1109/INFOCOM48880.2022.9796683
M3 - Conference contribution
AN - SCOPUS:85133227374
T3 - Proceedings - IEEE INFOCOM
SP - 1379
EP - 1388
BT - INFOCOM 2022 - IEEE Conference on Computer Communications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 41st IEEE Conference on Computer Communications, INFOCOM 2022
Y2 - 2 May 2022 through 5 May 2022
ER -