FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

ICLR 2026 Conference Submission1377 Authors

03 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Inverse Reinforcement Learning, Generative Model, Online Exploration, Imitation Learning

TL;DR: This paper proposes a framework which addressed the lack-of-exploration limitation of FM policy via Inverse Reinforcement Learning (IRL).

Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a "student'' policy with simple MLP structure to explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a "teacher'' FM model, containing rich information of expert data distribution. Furthermore, the same "teacher'' FM model is utilized to regularize the "student'' policy's behavior to stabilize policy learning. Due to the student’s simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments shows that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

Primary Area: reinforcement learning

Submission Number: 1377

Loading