A Survey of Flow Matching in Reinforcement Learning

A Survey of Flow Matching in Reinforcement Learning

23 Mar 2026 (modified: 11 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Flow Matching (FM) has recently emerged as a principled and efficient generative modeling framework for reinforcement learning (RL), enabling expressive, multimodal policy parameterizations via deterministic probability transport. Compared to diffusion-based policies that rely on stochastic denoising chains, FM uses sampling based on ordinary differential equations (ODEs), with learned velocity fields, which can substantially reduce inference latency and simplify the incorporation of RL objectives. As research in flow-based RL rapidly accelerates across offline continuous control, online fine-tuning, and foundation model alignment, the literature has become highly fragmented. In this survey, we provide a comprehensive taxonomy of flow-matching approaches in reinforcement learning. We organize the literature along two axes: the target distribution being modeled (e.g., action policies, value critics, transition dynamics) and the mechanism of RL signal integration (e.g., energy-weighted regression, flow-based policy gradients, and group relative policy optimization). Furthermore, we survey emerging frontiers such as discrete and non-Euclidean action spaces, provide a systematic comparative analysis against Gaussian and diffusion baselines, and outline critical open problems. Ultimately, this survey serves as a foundational roadmap for the next generation of generative reinforcement learning and alignment.

Submission Type: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We thank all three reviewers for their helpful and constructive feedback. The revised manuscript includes the following changes. **New 2D taxonomy table (Reviewer zYVw, Point 1).** We added Table 1 in Section 3 that explicitly crosses the two axes of our taxonomy. Rows represent what the flow models (actions, critics, dynamics, preferences) and columns represent how the RL signal enters (weighted regression, ODE/pathwise gradients, GRPO, other mechanisms). Empty cells identify under-explored combinations. An accompanying paragraph highlights gaps visible in the 2D view, for example that GRPO-style updates have not yet been applied to critic or dynamics flows. **Improved method descriptions (Reviewer zYVw, Point 2).** In Section 4.2 (QIPO/EFM), we added a "Why this works" paragraph between Equations 16 and 17 that states the importance-reweighting argument informally and points readers to Theorems 4.1 and 4.3 in Zhang et al. (2025b). In Section 4.5 (ReFORM), we now clearly separate "Theoretical construction" (reflected ODE with local-time process $L_t$) from "Practical algorithm" (the projected/reflected Euler step), stating explicitly that the latter is a numerical surrogate that preserves the support-containment invariant without exactly reproducing the continuous dynamics. We point readers to Eq. 9, Eq. 12, and Theorem 1 in Zhang et al. (2026). **Benchmark comparison tables (Reviewers zYVw and 2L9P).** We added three empirical comparison tables in the new Section 10 (Comparative Analysis). Table 5 consolidates offline RL results across D4RL Gym-Loco (9-task average), AntMaze (6-task average), and OGBench (50-task average) for Gaussian, diffusion, and flow-matching methods, reporting normalized return, NFE per action, and per-action generation time. Table 6 covers online RL results on MuJoCo Playground, DMC, and OpenAI Gym. Table 7 covers the GRPO family on text-to-image alignment (GenEval, OCR Accuracy, PickScore). All captions note these are not head-to-head reproductions. Each table is followed by a "Reading Table" paragraph that synthesizes key trends. **Benchmarks and evaluation protocols (Reviewer 2L9P).** We added a new Section 9 (Benchmarks and Evaluation Protocols) covering offline RL benchmarks (D4RL, OGBench), online RL benchmarks (Gymnasium MuJoCo, MuJoCo Playground, DeepMind Control Suite), alignment benchmarks (GenEval, PickScore, OCR accuracy), and evaluation metrics beyond return (NFE, wall-clock time). The section concludes with a reporting caveat articulating that fair comparisons require considering task performance, inference cost, and benchmark family simultaneously. **Section 7 synthesis (Reviewer zYVw, Point 4).** We reorganized Section 7 so each subsection groups papers into thematic clusters with explicit tradeoff comparisons rather than listing papers individually. Section 7.1 identifies shared design logic across discrete, combinatorial, and Riemannian extensions. Section 7.2 organizes architectural innovations along a spectrum from backbone redesign to residual correction to single-step collapse, with explicit tradeoff discussion. Section 7.3 synthesizes distributional flow critics around the tension between ODE integration cost and each method's solution. Section 7.4 frames intermediate credit assignment as the unifying problem behind alignment and trajectory-level works. A closing paragraph notes cross-cutting connections across all four subsections. **Reduced overlap between Sections 2.4 and 3.2 (Reviewer zYVw, Point 5).** Section 3.2 was shortened to a single paragraph that names each integration category, gives a one-sentence characterization, and cross-references the corresponding Sections 4 and 5 where the technical content is fully developed. This eliminates the redundancy with Section 2.4 while preserving parallel structure with Section 3.1. **Applications section (Reviewer XPfe).** We added a new Section 8 (Applications of Flow Matching in RL) that synthesizes how the surveyed methods are deployed across application domains, with separate paragraphs on robotics and visuomotor control, image and video generation alignment, speech and audio, language model reasoning, autonomous driving and traffic simulation, multi-agent coordination and goal-conditioned navigation, and transfer learning under dynamics shift.

Assigned Action Editor: ~Shangtong_Zhang1

Submission Number: 8049

Loading