Unifying Reinforcement Learning and Distillation via Distribution Matching for Video Generation

Jiuzhou Lin; Fei Zuo; junlong wu; Dewen Fan; Boheng Zhang; Huaiqing Wang; Jia Sun; Jinge Li; Yuqing DING; Fan Yang; Houde Liu; Kehai Chen; Min Zhang; Tingting Gao; Guorui Zhou

Unifying Reinforcement Learning and Distillation via Distribution Matching for Video Generation

Jiuzhou Lin, Fei Zuo, junlong wu, Dewen Fan, Boheng Zhang, Huaiqing Wang, Jia Sun, Jinge Li, Yuqing DING, Fan Yang, Houde Liu, Kehai Chen, Min Zhang, Tingting Gao, Guorui Zhou

18 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video generation, reinforcement learning, distillation, generative model

TL;DR: We unify RL alignment and distillation via distribution matching for efficient, high-quality video generation in one stage.

Abstract: Reinforcement learning (RL) for aligning visual generative models faces dual challenges: (1) reward evaluation typically requires generation via extensive multi-step sampling (20-40 steps), and (2) existing GRPO-Based methods necessitate complex conversions to adapt ODE-based sampling in flow matching to the Markov Decision Process formulation. While distillation techniques enable few-step generation (e.g., 4 steps for a video), RL-after-distillation often leads to model collapse, whereas conventional workflows applying RL-before-distillation incur prohibitive computational costs. We address these limitations through a simple yet efficient unified framework that jointly optimizes alignment and distillation within a single stage. Inspired by Distribution Matching Distillation (DMD), our approach implements alignment directly via distribution matching (DM) through separately developed novel losses: DM-PairLoss (DPO-inspired) and DM-GroupLoss (GRPO-inspired). This methodology eliminates the need for reverse-SDE conversions while enabling direct reward evaluation from few-step generations. Comprehensive experiments on the Wan 2.1 text-to-video model demonstrate that our unified approach preserves distillation capabilities while achieving better human preference alignment, outperforming the raw base model, standalone distilled variant, and two-stage alignment-distillation alternatives on both VBench metrics and human evaluations. The synergistic optimization enhances both human preference alignment and distillation quality. We will release code and pretrained models to facilitate community research.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 11201

Loading