Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

ICLR 2026 Conference Submission20125 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Reinforcement Learning, Behavior Cloning, Flow Matching

Abstract: Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy to focus on cloning high-value actions from the dataset rather than imitating all state-action pairs indiscriminately. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 129 tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks.

Primary Area: reinforcement learning

Submission Number: 20125

Loading