POLICY DEGENERACY IN DEEP REINFORCEMENT LEARNING FOR RECOMMENDATIONS: AN EMPIRICAL STUDY

ICLR 2026 Conference Submission22636 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning; recommender systems; offline evaluation; simulation; DQN; reproducibility
TL;DR: A tuned DQN ties a simple heuristic in a cleaned offline recommender simulator, revealing a performance ceiling imposed by the evaluation substrate.
Abstract: Deep Reinforcement Learning (RL) offers a promising framework for learning adaptive policies in recommender systems, particularly for the cold-start prob- lem where balancing precision and discovery is crucial. In this work, we provide a transparent and reproducible benchmark to investigate this challenge, training a highly-optimized Deep Q-Network (DQN) agent within a high-fidelity offline simulation to learn a dynamic recommendation policy. Contrary to expectations, our final evaluation reveals that the trained agent’s performance is at a statistical tie with a simple, static heuristic baseline across a suite of key metrics, including cumulative reward and NDCG@10. However, we show that this statistical parity in outcomes masks a fundamental divergence in behavior: the heuristic employs a conservative, exploitation-heavy strategy, while the RL agent learns a radically different and more exploratory policy. We argue that this result is not a failure of the agent, but rather a crucial insight into the limitations of offline evaluation. The finding provides powerful, empirical evidence that the simulation environment it- self can create a ”performance ceiling,” lacking the fidelity to distinguish between a good policy and a potentially great one. Our work thus serves as a crucial bench- mark and cautionary tale, signaling an urgent need for the community to develop richer offline evaluation environments or prioritize hybrid online-offline methods to bridge the gap between simulation and real-world impact.
Primary Area: reinforcement learning
Submission Number: 22636
Loading