Scale-Adapter: Reversed Distillation Adapter for Efficient Training of Large Video Diffusion Models

Yinhan Zhang; Yue Ma; Fangqiu Yi; Kunyu Feng; Chenyang Qi; Chi Zhang; Haoran Geng; Qifeng Chen; Zeyu Wang; Xuelong Li

Scale-Adapter: Reversed Distillation Adapter for Efficient Training of Large Video Diffusion Models

Yinhan Zhang, Yue Ma, Fangqiu Yi, Kunyu Feng, Chenyang Qi, Chi Zhang, Haoran Geng, Qifeng Chen, Zeyu Wang, Xuelong Li

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adapter, Diffusion Model, ControlNet, Video Generation

Abstract: We propose Scale-Adapter, a plug-and-play adapter designed to efficiently bridge conditional knowledge from small adapted models to large video diffusion transformers. Existing controllable video DiT methods face critical inefficiencies: full fine-tuning of billion-parameter models is prohibitively expensive, while cascaded ControlNets introduce significant parameter overhead and exhibit limited flexibility for novel multi-condition compositions. To overcome these issues, Scale-Adapter introduces a novel reversed distillation method that allows a large video diffusion model to inherit precise control capabilities from efficiently tuned small video diffusion models, completely avoiding full fine-tuning. Moreover, recognizing the intrinsic relationships among different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, thereby supporting both single condition control and multiple condition combinations without additional training cost. To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames. Experiments demonstrate that Scale-Adapter enables high-fidelity multiple condition video synthesis, making advanced controllable video generation feasible on low-resource hardware and establishing a new efficiency standard for the field.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9334

Loading