Value-as-Return: A Two-Stage Framework to Align on the Optimal Score Function

Shikun Sun; Shuo Huang; Yiding Chen; Wen Sun; Jia Jia

Value-as-Return: A Two-Stage Framework to Align on the Optimal Score Function

Shikun Sun, Shuo Huang, Yiding Chen, Wen Sun, Jia Jia

05 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Models, RL

Abstract: Reinforcement learning with diffusion models has shown strong potential, but existing approaches such as variants of Direct Preference Optimization (DPO) often rely on an inaccurate simplification: they equate trajectory likelihoods with final-state probabilities. This mismatch leads to suboptimal alignment. We address this limitation with a principled framework that leverages the optimal value function as the return for short trajectory segments. Our approach follows a two-stage procedure: (i) learning a value-distribution function to estimate segment-level returns, and (ii) applying our VRPO to refine the score function. We prove that, under sufficient model capacity, the resulting model is equivalent to training a diffusion process on the tilted distribution proportional to $p(x)\exp(\eta r(x))$. Experiments on large-scale diffusion models validate our analysis and show stable and consistent improvements over prior methods.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 2221

Loading