Quantization Meets Sparsification for Faster Image Generation

ICLR 2026 Conference Submission19461 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Transformer, Pruning and Sparsity, Quantization, Model Inference
Abstract: With the advancement of deep learning, various generative models have emerged in recent years. Diffusion models, notably exemplified by the Diffusion Transformer (DiT), have become a key component in generative modeling, demonstrating outstanding performance in vision generation tasks. In inference scenarios for generative tasks, quantization and sparsification are widely used techniques to reduce the memory consumption and computation cost. However, existing methods focus on only one of these techniques, and it remains underexplored whether we can leverage the strengths of both. To fill this gap, this work develops a brand new acceleration framework that applies offline sparsification and quantization to DiT models, facilitating faster image generation while preserving generation quality. Furthermore, we develop a novel and efficient matrix multiplication kernel that leverages low-bit and sparse computing capabilities of Tensor Cores. We conduct experiments on both the 12B open-source FLUX.1-dev model and the 18B closed-source MoE model from our industrial partner. Empirical results show that our kernel achieves a speedup of 1.64-2.16×, delivering an efficiency improvement of 1.09-1.35× to the end-to-end workflow, while incurring negligible degradation in generation quality
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 19461
Loading