Keywords: offline RL, consistency models, diffusion models
TL;DR: A method that makes diffusion planners dramatically faster in offline RL by distilling them into single-step consistency trajectory models that directly optimizes for rewards, achieving both better performance and significant speedups
Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a $9.7$% improvement over previous state-of-the-art while offering up to $142\times$ speedup over diffusion counterparts in inference time.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 12390
Loading