Mobile Video Diffusion

Amir Habibian

Published: 31 Dec 2024, Last Modified: 05 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. By optimizing the spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational requirements. We achieve this by lowering the resolution to 512 x 256 px, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks in the UNet. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi 14 Pro.