Q-Guided Flow Q-Learning

Published: 16 Sept 2025, Last Modified: 26 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning, Flow Matching, Generative Policies, Actor--Critic, Value Guidance
Abstract: Generative policies improve expressivity over Gaussian actors but often come with entangled training pipelines (e.g., joint actor--critic training, student--teacher distillation, or sequence-to-sequence planners). We introduce \emph{Q-Guided Flow Q-Learning (QFQL)}, an actor--critic framework where the actor is trained \emph{independently} via conditional flow matching for behavior cloning, and the critic is trained \emph{separately} via temporal-difference (TD) learning. At inference, actions are produced by integrating the flow field and adding a value-seeking correction proportional to the action-gradient of the critic, i.e., a guidance term $\beta\nabla_a Q(s,a)$. This decoupled design simplifies optimization, reduces instability from joint updates, and enables controllable trade-offs between behavioral realism and value-seeking at test time. Empirically, QFQL achieves strong offline reinforcement learning (RL) performance and stable training across tasks without auxiliary student models or policy regularizers, making it a strong candidate for offline RL.
Submission Number: 17
Loading