SlateFree: a Model-Free Decomposition for Reinforcement Learning with Slate ActionsDownload PDF

16 May 2022 (modified: 05 May 2023)NeurIPS 2022 SubmittedReaders: Everyone
Keywords: reinforcement learning, slate, MDP, decomposition, recommender system
TL;DR: The paper studies Reinforcement Learning with slates of N items as actions, and proves that Q-values for slates can be decomposed using Q-values for individual items.
Abstract: We consider the problem of sequential recommendations, where at each step an agent proposes some slate of $N$ distinct items to a user from a much larger catalog of size $K>>N$. The user has unknown preferences towards the recommendations and the agent takes sequential actions that optimise (in our case minimise) some action-related cost, with the help of Reinforcement Learning. The possible item combinations for a slate is $\binom{K}{N}$, an enormous number rendering value iteration methods intractable. We prove that the slate-MDP can actually be decomposed using just $K$ item-related $Q$ functions per state, which describe the problem in a more compact and efficient way. Based on this, we propose a novel model-free SARSA and Q-learning algorithm that performs $N$ parallel iterations per step, without any prior user knowledge. We call this method SlateFree, i.e. free-of-slates, and we show numerically that it converges very fast to the exact optimum for arbitrary user profiles, and that it outperforms alternatives from the literature.
14 Replies

Loading