Maximum Entropy Reinforcement Learning with Diffusion Policy

Xiaoyi Dong; Jian Cheng; Xi Sheryl Zhang

Maximum Entropy Reinforcement Learning with Diffusion Policy

Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We employ the diffusion model as the policy representation to achieve the Maximum Entropy RL objective, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy.

Abstract: The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.

Lay Summary: The Soft Actor-Critic (SAC) algorithm is a popular method in reinforcement learning, which is a type of artificial intelligence where agents learn by trial and error. SAC helps agents learn better by encouraging them to explore different actions instead of always choosing the same ones. This is done using a technique called "maximum entropy," which promotes more varied and flexible decision-making. Traditionally, SAC uses something called a Gaussian policy to decide what actions to take. This works well in simple situations, but it struggles with more complex tasks that involve multiple goals, because it's only good at focusing on one option at a time. In our work, we introduce a new method called MaxEntDP. Instead of using the traditional approach, we use a powerful AI tool known as a diffusion model. This model can represent many different possibilities at once, helping the agent explore more efficiently and make better decisions. We test our method in simulated environments and find that it performs better than the standard approach and even matches the performance of some of the best modern methods. Our code is available at: https://github.com/diffusionyes/MaxEntDP.

Link To Code: https://github.com/diffusionyes/MaxEntDP

Primary Area: Reinforcement Learning->Deep RL

Keywords: Diffusion models, online reinforcement learning, maximum entropy reinforcement learning, soft actor-critic

Submission Number: 10645

Loading