Keywords: Offline Reinforcement Learning, Flow Policy
Abstract: Flow-matching policies have recently emerged as a powerful class of generative models for offline reinforcement learning (RL), capable of capturing complex, multi-modal action distributions from static datasets. However, standard training objectives are largely agnostic to the global properties of the generative path, permitting learned vector fields that are irregular and unstable, which can hinder performance. In this work, we introduce PDE-regularized Q-Learning (PQL), a novel algorithm that addresses this limitation by imposing a principled structure on the entire probability flow. PQL makes two synergistic contributions: first, a partial differential equation based regularizer derived from the continuity equation enforces global smoothness and stability on the flow. Second, to solve the complex optimization problem introduced by this regularizer, we propose a Beta-distributed timestep sampling strategy that focuses learning on the critical trajectory segments where the trade-off between imitation and smoothness is most acute. Through extensive experiments, we demonstrate that by structuring the generative journey and not just its destination, PQL achieves state-of-the-art performance on a wide range of challenging offline RL tasks.
Primary Area: reinforcement learning
Submission Number: 5408
Loading