Track: tiny / short paper (up to 4 pages)
Keywords: Efficiency, Consistency Model, Probability Flow-ODE, Reinforcement Learning, Policy Gradient
TL;DR: Methods for distilling an improved policy into a policy represented by Consistency Model efficiently
Abstract: This paper proposes an efficient consistency model (CM) training scheme tailored for the policy distillation step common in reinforcement learning (RL). Specifically, we leverage the Probability Flow ODE (PF-ODE) and introduce two novel training objectives designed to improve CM training efficiency when target policy log-probabilities are available for a limited set of reference actions. We propose Importance Weighting (IW) and Gumbel-Based Sampling (GBS) as strategies to refine the learning signal under these limited sampling budgets. Our approach enables more efficient training by directly incorporating target probability estimates, which aims to reduce variance and improve sample efficiency compared to standard CM training that relies solely on samples. Numerical experiments in a controlled setting demonstrate that our proposed methods, particularly IW, outperform conventional CM training, achieving more accurate policy representations with limited reference data.
These findings highlight the potential of using CMs, trained with our proposed objectives, as an efficient alternative method for the policy distillation component within RL algorithms.
Submission Number: 108
Loading