TY - GEN
T1 - Towards Off-Policy Learning for Ranking Policies with Logged Feedback
AU - Xiao, Teng
AU - Wang, Suhang
N1 - Publisher Copyright:
Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2022/6/30
Y1 - 2022/6/30
N2 - Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.
AB - Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.
UR - http://www.scopus.com/inward/record.url?scp=85147715748&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147715748&partnerID=8YFLogxK
U2 - 10.1609/aaai.v36i8.20849
DO - 10.1609/aaai.v36i8.20849
M3 - Conference contribution
AN - SCOPUS:85147715748
T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
SP - 8700
EP - 8707
BT - AAAI-22 Technical Tracks 8
PB - Association for the Advancement of Artificial Intelligence
T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022
Y2 - 22 February 2022 through 1 March 2022
ER -