Keywords: GPU, Sparse Tensor Core, Diffusion Model
Abstract: Diffusion models have emerged as a powerful
class of generative models that excel at capturing
complex data distributions and producing realis-
tic, high-fidelity samples. However, these benefits
come at the cost of expensive computation and
memory requirements due to their iterative denois-
ing process. The cost is especially significant for
high-resolution images, videos, 3D data, or long
sequences. In this paper, we propose CoDM, a
co-design framework that seamlessly integrates
model compression techniques with the sparse
tensor cores of NVIDIA Hopper H100 GPUs.
By leveraging specialized hardware capabilities
and jointly optimizing the model compression
scheme and storage format, CoDM achieves sig-
nificant model speedup while maintaining data
generation quality. Specifically, our approach
enhances diffusion models through several key
strategies, namely reducing inference steps and
model weights through a novel hierarchical prun-
ing scheme, improving memory efficiency via a
new sparse storage format, and leveraging Ten-
sorRT optimization and the specialized cores of
GPU hardware accelerators. This codesign ap-
proach addresses the computational challenges
of diffusion models, making them more acces-
sible for real-world applications. Experimental
results in a Text-to-Image application demonstrate
that our approach surpasses the state-of-the-art,
achieving a 7.4-fold speedup on the ImageNet
(256×256) dataset and an 11.5-fold speedup on
the CIFAR-10(32×32) dataset, all while preserv-
ing the quality of the generated images with a
similar or lower Fr echet Inception Distance (FID)
score.
Submission Number: 38
Loading