AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models

28 Sept 2024 (modified: 12 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Models, Codebook Quantization, Model Quantization, Model Compression
TL;DR: We take inspiration from recent advances in the field of LLM quantization to improve on the state-of-the-art for compression of diffusion models across several metrics.
Abstract: Tremendous investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have seen considerable success in easing this burden, yet without exception have explored only the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have recently taken the parallel field of Large Language Model (LLM) quantization by storm. In this work, we for the first time apply codebook-based additive vector quantization algorithms to the problem of diffusion model compression, adapting prior works on the quantization-aware fine-tuning of transformer-based LLMs to take into account the special structure of convolutional weight tensors, the heterogeneity in the kinds of operations performed by the layers of a diffusion model, and the momentum-invalidating discontinuities encountered between successive batches during quantization-aware fine-tuning of diffusion models. We are rewarded with a data-free distillation framework which achieves to the best of our knowledge state-of-the-art results for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.93 points lower than the full-precision model at W4A8, the best-reported results for FID, sFID and ISC at W2A8, and the first-ever successful quantization to W1.5A8 (less than 1.5 bits stored per weight) via a layer-wise heterogeneous quantization strategy. We thus establish a new Pareto frontier for diffusion model inference under low-memory conditions. Furthermore, our method allows for a dynamic trade-off between quantization-time GPU hours and inference-time savings, thus aligning with the recent trend of approaches that combine the best aspects of both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to BOPs (Bit-wise Operations) savings resulting from small integer operations that may lack broad support across hardware of interest.\\ Code is released via anonymized download link.\\\footnotesize{\url{https://osf.io/3uf8v/?view_only=ffbc957d6ce941d7b47bef09b628adcd}.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13576
Loading