Keywords: Offline Reinforcement Learning, Behavior Cloning, Flow Matching
Abstract: Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution.
However, such approaches fail to distinguish between high-value and low-value actions.
We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor.
The actor directs the flow policy to focus on cloning high-value actions from the dataset rather than imitating all state-action pairs indiscriminately.
In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic.
This mutual guidance enables GFP to achieve state-of-the-art performance across 129 tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks.
Primary Area: reinforcement learning
Submission Number: 20125
Loading