Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, Exploration, Maximum entropy RL, Policy gradient, Off-policy
TL;DR: A maximum entropy reinforcement learning framework is proposed, using intrinsic rewards based on the relative entropy of the distribution of future state-action pairs, resulting in high-performing control policies and efficient off-policy learning.
Abstract: Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by two results. First, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Second, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and to compute the intrinsic rewards. We finally introduce an algorithm maximizing our new objective, and we show that resulting policies have good state-action space coverage and achieve high-performance control.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Adrien_Bolland1
Track: Regular Track: unpublished work
Submission Number: 115
Loading