DIME: Diffusion-Based Maximum Entropy Reinforcement Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We provide DIME, a method for training diffusion-based policies in the maximum entropy reinforcement learning framework.
Abstract: Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.
Lay Summary: Reinforcement-learning agents learn by exploring many possible actions, yet most systems still rely on simple Gaussian noise to create that exploration. This narrow choice can stunt learning on complex, high-dimensional tasks such as making a simulated dog run or a robotic hand twirl a pen. We present DIME (Diffusion-Based Maximum-Entropy RL). DIME swaps the Gaussian policy for a more expressive diffusion model—the same technology behind modern image generators—and embeds it inside the maximum-entropy RL objective that explicitly rewards exploration. We derive a new mathematical lower bound that makes the normally intractable objective computable and implement a practical version that trains end-to-end with standard deep-learning tools. Across 13 demanding simulated locomotion and manipulation benchmarks, DIME shows favorable performance over other diffusion-based baselines and outperforms leading Gaussian-policy methods on 10 of the tasks.
Link To Code: https://alrhub.github.io/dime-website/
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reinforcement Learning, Diffusion Models, Diffusion Based Reinforcement Learning, Maximum Entropy Reinforcement Learning
Submission Number: 12014
Loading