Maximum-Entropy Exploration with Future State-Action Visitation Measures

Adrien Bolland; Gaspard Lambrechts; Damien Ernst

Maximum-Entropy Exploration with Future State-Action Visitation Measures

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Maximum Entropy RL, Exploration

TL;DR: We propose maximum-entropy exploration using intrinsic rewards proportional to the entropy of the discounted distribution of future features; we compare to existing approaches and discuss behaviors of exploration policies.

Abstract: Maximum entropy reinforcement learning motivates agents to explore states and actions by providing intrinsic rewards proportional to the entropy of some distribution. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that this new objective is a lower bound on the standard objective providing intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during full trajectories, i.e., starting from initial states. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. The intrinsic reward can therefore be computed off-policy. We quantify and compare the exploration effectiveness of different maximum entropy objectives. Experiments highlight that the new objective leads to feature exploration concurrent to the alternative methods. In expectation over trajectories, features are typically visited less often, as suggested by the lower bound, but over individual trajectories, features are visited more often than the concurrent approaches. All methods lead to similar control performance on the considered benchmarks.

Primary Area: reinforcement learning

Submission Number: 4736

Loading