Rethinking Offline Reinforcement Learning for Sequential Recommendation from A Pair-Wise Q-Learning Perspective

Runqi Yang, Liu Yu, Zhi Li, Shaohui Li, Likang Wu

Published: 2024, Last Modified: 15 Jan 2026IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Sequential recommendations have gained great attention in recent years. Considering that user interactions naturally follow sequential decision tasks, many researchers have explored the integration of Reinforcement Learning (RL) into sequential recommendations. However, the limited exploration gap between recommender systems and other RL scenarios brings great challenges. Fortunately, Offline RL, which learns from expert experiences in offline datasets, has emerged as a potential solution. Nevertheless, utilizing offline RL to model user preferences is non-trivial due to strict model overestimation and implicit feedback issues. To address these challenges, we propose a novel offline RL framework called Pair-Wise Q-Learning (PQL). Our approach includes a state encoder module to generate a unified representation of user interaction states, a conservative Q-Learning approach with double Q-Networks to reduce overestimation and enhance robustness, a dynamic negative sampling strategy based on Q-values, and a pair-wise learning module to handle implicit feedback. We jointly optimize our model using conservative Q-Learning and pair-wise learning. Experimental results on two real-world e-commerce datasets, considering clicks and purchases, demonstrate the superior performance of our proposed PQL over existing methods.