Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Exploration, Policy gradient
Abstract: In reinforcement learning (RL), agents benefit from exploration because they repeatedly encounter the same or similar states, where trying different actions can improve performance or reduce uncertainty; otherwise, a greedy policy would be optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples ($M \in \mathbb{N}$), while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration without bonuses and outperforms entropy-regularized PPO on the MinAtar benchmark.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23121
Loading