What Flow-Matching Brings to TD Learning?

Bhavya Kumar Agrawalla; Michal Nauman; Aviral Kumar

What Flow-Matching Brings to TD Learning?

Bhavya Kumar Agrawalla, Michal Nauman, Aviral Kumar

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, flow-matching, plasticity, offline RL, online RL, value functions

TL;DR: Understanding why flow-matching Q-functions work well in off-policy RL

Abstract: Recent work shows that flow-matching networks can be effective for value function estimation in reinforcement learning, but it remains unclear why they work well or whether flow-matching Q-functions differ fundamentally from standard critics. We show that their success is not explained by distributional RL: explicitly modeling return distributions often degrades performance. Instead, we argue that flow-matching Q-functions are effective because they couple a learned velocity field with an integration procedure that is used both during training and to read out Q-values at inference time. This coupling enables robust value prediction through \emph{test-time recovery} from imperfect intermediate estimates where errors dampen out as more integration steps are performed. This mechanism is absent in monolithic critics. Beyond test-time recovery, training with the integration procedure induces more \emph{plastic} representations, allowing critics to represent non-stationary future TD targets without overwriting previous features. We formalize these effects and validate them empirically, showing that flow-matching critics outperform monolithic critics by over $2\times$ in performance and achieve $5$–$10\times$ higher sample efficiency in high-UTD regimes.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 97

Loading