Abstract: Reinforcement learning (RL) aims to formulate the recommendation task as a Markov decision process (MDP) and trains an agent to automatically learn the optimal recommendation policy from interaction trajectories through trial-and-error and reward mechanisms. However, most existing RL-based approaches overlook the correlation between items and the dynamics of user interests implied in temporally close interactions. Therefore, in this paper, we propose a reinforcement learning method that incorporates a “recent-k items” distribution to capture users' local preferences. Specifically, we model the output layer as two distinct branches. The “recent-k items” branch, formulated with a Kullback-Leibler divergence loss, learns the recent interests of users, whereas the other branch utilizes a one-step temporal difference error to capture long-term preferences. The proposed structure is integrated into deep Q-learning and actor-critics, resulting in two enhanced methods named R$k$Q and R$k$AC, respectively. Furthermore, a novel soft inter-reward is carefully designed to enhance the proposed method, and we theoretically prove the convergence of the proposed algorithm. We perform extensive experiments on two large real-world datasets and conduct further analysis of the influences of different action sequences, time intervals, and enhancement capabilities for state-of-the-art models. The experimental results demonstrate the efficacy of our proposed methods.
Loading