Keywords: RL post-training; Video Generation; Human Preference Alignment
Abstract: Reinforcement learning (RL) post-training aligns diffusion-based generators with human preferences, yet existing RL methods suffer from poor compatibility with off-policy learning and few-step distilled models. These limitations are especially severe in the video generation area, as practical video generation pipelines often rely on few-step distilled generators. Furthermore, due to complex spatial-temporal dynamics and higher dimensions, near-on-policy video rollouts are both expensive to collect and often imperfect. Relying on such rollouts alone can amplify artifacts and is prone to reward hacking. To address these issues, we propose Forward-Consistent Reward Matching (FCRM), an efficient off-policy RL framework for video generation. FCRM converts the forward denoising loss into a positive loss-induced score and formulates the reward alignment as a one-step GFlowNet matching problem. The resulting residual is pointwise in a clean sample space that naturally supports off-policy learning and few-step generators. To avoid biased gradients, we introduce a double-sampling estimator for the squared residual objective. Theoretically, minimizing the proposed matching residual bounds the KL divergence between the learned distribution and the optimal reward-tilted distribution. Experiments on standard video generation benchmarks validate FCRM across online, replay, offline, and few-step settings and outperform SOTA methods.
Submission Number: 126
Loading