Scale-Adapter: Reversed Distillation Adapter for Efficient Training of Large Video Diffusion Models

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adapter, Diffusion Model, ControlNet, Video Generation
Abstract: We propose Scale-Adapter, a plug-and-play adapter designed to efficiently bridge conditional knowledge from small adapted models to large video diffusion transformers. Existing controllable video DiT methods face critical inefficiencies: full fine-tuning of billion-parameter models is prohibitively expensive, while cascaded ControlNets introduce significant parameter overhead and exhibit limited flexibility for novel multi-condition compositions. To overcome these issues, Scale-Adapter introduces a novel reversed distillation method that allows a large video diffusion model to inherit precise control capabilities from efficiently tuned small video diffusion models, completely avoiding full fine-tuning. Moreover, recognizing the intrinsic relationships among different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, thereby supporting both single condition control and multiple condition combinations without additional training cost. To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames. Experiments demonstrate that Scale-Adapter enables high-fidelity multiple condition video synthesis, making advanced controllable video generation feasible on low-resource hardware and establishing a new efficiency standard for the field.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9334
Loading