Keywords: reinforcement learning; recommender systems; offline evaluation; simulation; DQN; reproducibility
TL;DR: A tuned DQN ties a simple heuristic in a cleaned offline recommender simulator, revealing a performance ceiling imposed by the evaluation substrate.
Abstract: Deep Reinforcement Learning (RL) offers a promising framework for learning
adaptive policies in recommender systems, particularly for the cold-start prob-
lem where balancing precision and discovery is crucial. In this work, we provide
a transparent and reproducible benchmark to investigate this challenge, training
a highly-optimized Deep Q-Network (DQN) agent within a high-fidelity offline
simulation to learn a dynamic recommendation policy. Contrary to expectations,
our final evaluation reveals that the trained agent’s performance is at a statistical
tie with a simple, static heuristic baseline across a suite of key metrics, including
cumulative reward and NDCG@10. However, we show that this statistical parity
in outcomes masks a fundamental divergence in behavior: the heuristic employs
a conservative, exploitation-heavy strategy, while the RL agent learns a radically
different and more exploratory policy. We argue that this result is not a failure of
the agent, but rather a crucial insight into the limitations of offline evaluation. The
finding provides powerful, empirical evidence that the simulation environment it-
self can create a ”performance ceiling,” lacking the fidelity to distinguish between
a good policy and a potentially great one. Our work thus serves as a crucial bench-
mark and cautionary tale, signaling an urgent need for the community to develop
richer offline evaluation environments or prioritize hybrid online-offline methods
to bridge the gap between simulation and real-world impact.
Primary Area: reinforcement learning
Submission Number: 22636
Loading