Abstract: Recently, video diffusion models (VDMs) have garnered significant attention due to their notable advancements in generating coherent and realistic video content. However, processing multiple frame features concurrently, coupled with the considerable model size, results in high latency and extensive memory consumption, hindering their broader application. Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. Unlike image diffusion, we observe that the temporal features, which are integrated into all frame features, exhibit pronounced skewness. Furthermore, we investigate significant inter-channel disparities and asymmetries in the activation of video diffusion models, resulting in low coverage of quantization levels by individual channels and increasing the challenge of quantization. To address these issues, we introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. Specifically, we propose the High Temporal Discriminability Quantization (HTDQ) method, designed for temporal features, which retains the high discriminability of quantized features, providing precise temporal guidance for all video frames. In addition, we present the Scattered Channel Range Integration (SCRI) method which aims to improve the coverage of quantization levels across individual channels. Experimental validations across various models, datasets, and bit-width settings demonstrate the effectiveness of our QVD in terms of diverse metrics. In particular, we achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: As a critical medium in today's information dissemination, video forms an essential component of multimedia. Video generation, particularly through video diffusion models, has emerged as a recent research focus due to its superior performance in generating realistic and seamless videos. In our experiments, we provided prompts from various modalities to control the generation of video diffusion, such as text, motion images, sketches, and more. These multi-modal inputs allow for a more refined manipulation of the video generation process, enabling the creation of highly customized and contextually relevant videos. However, the substantial computational resource demands and high inference latency of these models have limited their wider application. This work is dedicated to compressing and accelerating video diffusion models to lower the barriers to their use. It focuses on post-training quantization for video diffusion models on video generation tasks guided by motion, sketches, text, and images, achieving 8-bit quantization of weights and activations with near-lossless performance degradation.
Supplementary Material: zip
Submission Number: 2320
Loading