PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning

ICLR 2026 Conference Submission5095 Authors

14 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Continuous Normalizing Flow, Entropy Regularization, Proximal Policy Optimization, Multimodal Policy
TL;DR: We present PolicyFlow, an on-policy reinforcement learning algorithm that unites continuous normalizing flows with PPO-style optimization and a novel Brownian entropy regularizer for expressive and stable multimodal policies.
Abstract: Among on-policy reinforcement learning algorithms, Proximal Policy Optimiza- tion (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is com- putationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algo- rithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead without compromising training stability. To further prevent mode collapse and further encourage diverse behaviors, we pro- pose the Brownian Regularizer, an implicit policy entropy regularizer inspired by Brownian motion, which is conceptually elegant and computationally lightweight. Experiments on diverse tasks across vairous environments including MultiGoal, PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO using Gaussian policies and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal highlight PolicyFlow’s ability to capture richer multimodal action distributions.
Primary Area: reinforcement learning
Submission Number: 5095
Loading