Abstract: Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data rather than expensive online interactions. In this paper, we propose Value Penalized Q-learning (VPQ), a novel uncertainty-based offline RL algorithm that penalizes the unstable Q-values in the regression target using uncertainty-aware weights, achieving the conservative Q-function without the need of estimating the behavior policy, suitable for RS with a large number of items. Experiments on two real-world datasets show the proposed method serves as a gain plug-in for existing RS models.
0 Replies
Loading