Keywords: Diffusion Transformer, Pruning and Sparsity, Quantization, Model Inference
Abstract: With the advancement of deep learning, various generative models have emerged in
recent years. Diffusion models, notably exemplified by the Diffusion Transformer
(DiT), have become a key component in generative modeling, demonstrating
outstanding performance in vision generation tasks. In inference scenarios for
generative tasks, quantization and sparsification are widely used techniques to
reduce the memory consumption and computation cost. However, existing methods
focus on only one of these techniques, and it remains underexplored whether we
can leverage the strengths of both.
To fill this gap, this work develops a brand new acceleration framework that
applies offline sparsification and quantization to DiT models, facilitating faster
image generation while preserving generation quality. Furthermore, we develop a
novel and efficient matrix multiplication kernel that leverages low-bit and sparse
computing capabilities of Tensor Cores. We conduct experiments on both the 12B
open-source FLUX.1-dev model and the 18B closed-source MoE model from our
industrial partner. Empirical results show that our kernel achieves a speedup of
1.64-2.16×, delivering an efficiency improvement of 1.09-1.35× to the end-to-end
workflow, while incurring negligible degradation in generation quality
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 19461
Loading