What Flow-Matching Brings To TD-Learning

Published: 03 Mar 2026, Last Modified: 07 Apr 2026ICLR 2026 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, flow-matching, plasticity, offline RL, online RL, value functions
TL;DR: Understanding why flow-matching Q-functions work well in off-policy RL
Abstract: Recent work shows that flow-matching can be effective for value estimation in RL, but it remains unclear why they work well or whether flow-matching Q-functions differ fundamentally from standard critics. We show that their success is not explained by distributional RL: explicitly modeling return distributions often degrades performance. Instead, we argue that flow-matching Q-functions are effective because they couple a learned velocity field with an integration that is used both during training and to read out Q-values at inference. This coupling enables robust value prediction through \emph{test-time recovery} from imperfect intermediate estimates where errors dampen out as more integration steps are performed. This mechanism is absent in monolithic critics. Beyond test-time recovery, training with the integration procedure induces more \emph{plastic} representations, allowing critics to represent non-stationary future TD targets without overwriting previous features. We formalize these effects and validate them empirically, showing that flow-matching critics outperform monolithic critics by over $2\times$ in performance and achieve $5$–$10\times$ higher sample efficiency in high-UTD regimes.
Submission Number: 52
Loading