SplitQuant: Efficient Low-Bit Quantization for Diffusion Transformers via In-Channel Dimension Splitting
Keywords: Diffusion models, quantization, diffusion transformers, low-bit quantization, image generation, video generation, model acceleration, memory optimization, post-training quantization
Abstract: Diffusion models currently dominate the field of image generation. However, generating high-resolution images requires larger-scale diffusion models that consume substantial computational resources and memory during inference. While post-training quantization offers a promising solution to reduce computational costs and memory usage through low-precision representations, existing approaches face significant challenges when applied to diffusion models. Unlike large language models that are memory-bound, \textbf{Di}ffusion \textbf{T}ransformers (DiT) are compute-intensive during inference. Consequently, current methods that rely on additional parameters to recover the performance of extremely low-bit quantized models achieve minimal acceleration benefits, as they introduce non-negligible computational overhead.
To address these challenges, we propose \textbf{SplitQuant}, a novel approach that reduces additional computational overhead while improving low-bit quantization performance by strategically splitting the in-channel dimension of linear layers and activations. Recognizing that diffusion transformer architectures differ fundamentally from large language models, we develop a specialized optimization pipeline tailored specifically for diffusion models, which significantly enhances the generation quality of low-bit quantized models. Additionally, we implement custom-optimized CUDA kernels for SplitQuant that render the preprocessing overhead from additional parameters and quantization processes negligible, achieving single-operator performance comparable to W$4$A$4$ QGeMM across various tensor shapes.
Extensive experiments on FLUX.1, PixArt-$\Sigma$, and Wan2.1 demonstrate SplitQuant's effectiveness in both image generation scenarios. Our method achieves $2.7\times$ acceleration on linear layer operators across different shapes, with SplitQuant kernels delivering performance that approaches Int4 QGeMM acceleration. The code is available at this \href{https://anonymous.4open.science/status/SplitQuant-23537iclrAnonymous}{anonymous link}.
Primary Area: generative models
Submission Number: 23537
Loading