Keywords: reinforcement learning, flow-matching, plasticity, offline RL, online RL, value functions
TL;DR: Understanding why flow-matching Q-functions work well in off-policy RL
Abstract: Recent work shows that flow-matching networks can be effective for value function estimation in reinforcement learning, but it remains unclear why they work well or whether flow-matching Q-functions differ fundamentally from standard critics. We show that their success is not explained by distributional RL: explicitly modeling return distributions often degrades performance. Instead, we argue that flow-matching Q-functions are effective because they couple a learned velocity field with an integration procedure that is used both during training and to read out Q-values at inference time. This coupling enables robust value prediction through \emph{test-time recovery} from imperfect intermediate estimates where errors dampen out as more integration steps are performed. This mechanism is absent in monolithic critics. Beyond test-time recovery, training with the integration procedure induces more \emph{plastic} representations, allowing critics to represent non-stationary future TD targets without overwriting previous features. We formalize these effects and validate them empirically, showing that flow-matching critics outperform monolithic critics by over $2\times$ in performance and achieve $5$–$10\times$ higher sample efficiency in high-UTD regimes.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 97
Loading