Keywords: Exploration, Policy gradient
Abstract: In reinforcement learning (RL), agents benefit from exploration because they repeatedly encounter the same or similar states, where trying different actions can improve performance or reduce uncertainty; otherwise, a greedy policy would be optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples ($M \in \mathbb{N}$), while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration without bonuses and outperforms entropy-regularized PPO on the MinAtar benchmark.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23121
Loading