S$2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Safa Messaoud; Billel Mokeddem; Zhenghai Xue; Linsey Pang; Bo An; Haipeng Chen; Sanjay Chawla

S$2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng Chen, Sanjay Chawla

Published: 16 Jan 2024, Last Modified: 11 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Max-Entropy RL, Entropy, Energy-Based-Models

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We derive a closed for expression of the entropy of Energy-Based-Models and use it to learn multi-modal RL policies.

Abstract: Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity and robustness. Notably, in Maximum Entropy reinforcement learning (MaxEnt RL), the policy is modeled as an expressive energy-based model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBM distributions which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, yielding high computational complexity and variance (SQL), or follow a variational inference approach that fits simplified distributions (e.g., Gaussian) for tractability (SAC). We propose Sein Soft Actor-Critic (S$^2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. S$^2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. At the core of S$^2$AC is a new solution to the above open challenge of entropy computation for EBMs. Our entropy formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$^2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://anonymous.4open.science/r/Stein-Soft-Actor-Critic/

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: reinforcement learning

Submission Number: 5997

Loading