Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori; Paavo Parmas; Sotetsu Koyamada; Tadashi Kozuno; Toshinori Kitamura; Shin Ishii; Yutaka Matsuo

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: An RL objective that leads to greater policy stochasticity despite it greedily maximizing rewards.

Abstract: In reinforcement learning (RL), agents benefit from exploration *only* because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with **ReMax**, an objective that evaluates a policy by the expected maximum return over $M$ samples ($M \in \mathbb{N}$), while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce **Re**Max **PPO** (**RePPO**), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration—without any explicit exploration bonuses—on the MinAtar and Craftax benchmarks. The official code is available at https://github.com/nissymori/remax-rl.

Lay Summary: Why do reinforcement learning agents need to try actions that do not currently seem best? We argue that exploration is useful because agents are uncertain about the environment and often have multiple chances to act. If an agent had only one chance, the rational choice would be to choose the action currently believed to be best. With repeated attempts, however, trying different actions can reveal better options and improve future decisions. We turn this intuition into a new training objective called ReMax. ReMax evaluates a policy by the best reward obtained over multiple attempts, while taking uncertainty about rewards into account. Unlike conventional exploration methods, which often add explicit exploration bonuses to the reward, ReMax encourages exploratory behavior purely through reward maximization. This makes it easier to incorporate ReMax into existing reinforcement learning pipelines. We study the properties of ReMax and show that the number of attempts controls how much the agent explores. We also develop a practical algorithm based on ReMax, called RePPO, and show that it promotes exploration and improves performance on several benchmark environments without using explicit exploration bonuses.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/nissymori/remax-rl

Primary Area: Reinforcement Learning->Policy Search

Keywords: Exploration, Policy gradient

Originally Submitted PDF: pdf

Submission Number: 34403

Loading