Efficient Off-Policy RL for Video Generation via Forward-Consistent Reward Matching

Hongzheng Yang; Mengyang LIU; Haoxuan Wu; Kun Li; Yuzhi Zhao; Wei Liu

Efficient Off-Policy RL for Video Generation via Forward-Consistent Reward Matching

Hongzheng Yang, Mengyang LIU, Haoxuan Wu, Kun Li, Yuzhi Zhao, Wei Liu

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RL post-training; Video Generation; Human Preference Alignment

Abstract: Reinforcement learning (RL) post-training aligns diffusion-based generators with human preferences, yet existing RL methods suffer from poor compatibility with off-policy learning and few-step distilled models. These limitations are especially severe in the video generation area, as practical video generation pipelines often rely on few-step distilled generators. Furthermore, due to complex spatial-temporal dynamics and higher dimensions, near-on-policy video rollouts are both expensive to collect and often imperfect. Relying on such rollouts alone can amplify artifacts and is prone to reward hacking. To address these issues, we propose Forward-Consistent Reward Matching (FCRM), an efficient off-policy RL framework for video generation. FCRM converts the forward denoising loss into a positive loss-induced score and formulates the reward alignment as a one-step GFlowNet matching problem. The resulting residual is pointwise in a clean sample space that naturally supports off-policy learning and few-step generators. To avoid biased gradients, we introduce a double-sampling estimator for the squared residual objective. Theoretically, minimizing the proposed matching residual bounds the KL divergence between the learned distribution and the optimal reward-tilted distribution. Experiments on standard video generation benchmarks validate FCRM across online, replay, offline, and few-step settings and outperform SOTA methods.

Submission Number: 126

Loading