Q-Guided Flow Q-Learning

Published: 16 Sept 2025, Last Modified: 17 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning, Flow Matching, Generative Policies, Actor--Critic, Value Guidance
Abstract: Generative policies improve expressivity over Gaussian actors but often come with entangled training pipelines (e.g., joint actor--critic training, student--teacher distillation, or sequence-to-sequence planners). We introduce \emph{Q-Guided Flow Q-Learning (QFQL)}, an actor--critic framework where the actor is trained \emph{independently} via conditional flow matching for behavior cloning, and the critic is trained \emph{independently} via temporal-difference (TD) learning. At inference, actions are produced by integrating the learned flow field and adding a value-seeking correction proportional to the action-gradient of the critic, i.e., a guidance term $\beta\nabla_a Q(s,a)$. This decoupled design simplifies optimization, reduces instability from joint updates, and enables controllable trade-offs between behavior realism and value seeking at test time. Empirically, QFQL achieves strong offline RL performance and stable training across tasks without auxiliary student models or policy regularizers, making it a strong candidate for offline reinforcement learning.
Submission Number: 17
Loading