Diffusion Guidance Is a Controllable Policy Improvement Operator

Diffusion Guidance Is a Controllable Policy Improvement Operator

ICLR 2026 Conference Submission14486 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, diffusion, guidance

Abstract: At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have shown to be remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is a policy improvement operator that is trained with the simplicity of supervised learning, yet is more effective than typically-used weighted policy extraction strategies. On offline RL tasks, we observe a reliable trend---increased guidance weighting leads to increased performance. Additionally, the CFGRL framework can be adapted to "directly'' extract policies from offline data *without* running a full end-to-end RL algorithm, allowing us to generalize simple supervised methods (e.g. goal-conditioned behavior cloning) to further prioritize optimality, gaining performance across the board without additional cost.

Primary Area: reinforcement learning

Submission Number: 14486

Loading