TL;DR: We generalize denoising score matcing by reweighting the loss function, enabling efficient online RL with diffusion policies.
Abstract: Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.
Lay Summary: Diffusion models are a type of AI that learn by gradually turning noise into realistic data, like images or actions, by studying many examples. They’ve recently been used to create smart decision-making systems that learn from past experiences. While they work well when the best examples are already available, it’s hard to use them in situations where the system has to learn by itself—like in online learning—because we don’t know what the “best” actions are yet. This new research introduces a smarter and more efficient training method that doesn’t need perfect examples and is much easier to run. As a result, the two new techniques they developed—called DPMD and SDAC—learn better and faster than previous methods in many robot control tasks.
Link To Code: https://github.com/mahaitongdae/diffusion_policy_online_rl
Primary Area: Reinforcement Learning->Deep RL
Keywords: reinforcement learning, diffusion models, diffusion policy
Submission Number: 9196
Loading