Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe; Jong-Kook Kim

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe, Jong-Kook Kim

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: reinforcement learning, maximum entropy rl, entropy regularization, policy optimization, generalization

TL;DR: We present a practical on-policy MaxEnt RL algorithm.

Abstract: Entropy regularisation is a widely adopted technique that enhances policy optimisation performance and stability. Many practical on-policy methods employ an entropy regularisation term to the policy gradient, thereby maximising policy entropy at visited states. On the other hand, another form of entropy regularisation, maximum entropy reinforcement learning (MaxEnt RL), augments the standard objective with an entropy term, aiming to maximise both the cumulative reward and the entropy of the trajectories induced by a policy. However, despite its empirical and theoretical achievements, its application in on-policy actor-critic contexts remains relatively underexplored. In this work, we propose an on-policy actor-critic algorithm based on the MaxEnt RL framework. A key aspect of our approach is separating the entropy objective from the MaxEnt RL objective. This delineation allows us to introduce an additional critic for the entropy objective alongside the conventional value critic. It also offers finer control over the optimisation process, incorporating a discount factor specifically for the entropy that provides a distinct way to balance the original and entropy objectives. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and replacing its entropy regularisation with the proposed method significantly improves the performance of PPO in both continuous control tasks and across 16 Procgen environments. Additionally, the results underline MaxEnt RL's capacity to enhance generalisation.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1554

Loading