Rate/Distortion Constrained Model Quantization for Efficient Storage and Inference

Alexander Conzelmann; Robert Bamler

Rate/Distortion Constrained Model Quantization for Efficient Storage and Inference

Alexander Conzelmann, Robert Bamler

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: quantization, model compression, rate-distortion theory, compression

TL;DR: We propose a compression method for pre-trained neural networks that combines quantization and entropy based neural network compression.

Abstract: The proliferation of large pre-trained neural networks has recently revived research in both quantization of network weights (for faster inference), and in their compression (to reduce file sizes). However, there has so far been little idea transfer between the two lines of research. In this paper, we combine techniques from quantization and compression to propose an efficient and highly effective post-training compression method for large neural networks. Our method extends the recently published quantization method OPTQ (Frantar et al., 2023) with a tunable rate/distortion trade-off by introducing a cost per bit into OPTQ's rounding operation. Crucially, we estimate the bit rate based on the predictive model used in the state-of-the-art neural network compression method NNCodec (Becking et al., 2023). In our experiments with several standard pre-trained networks from the computer vision community, our method leads to significantly (up to 2.7x) smaller file sizes than NNCodec at equal model performance, generally compressing to less than half a bit per network weight and implicitly pruning insignificant weights. Additionally, and in contrast to NNcodec, our method offers the same opportunities for inference speed-ups as OPTQ. By proving that file size and inference cost can be reduced simultaneously, we hope that our contribution shows a path towards deploying large neural networks on end-user devices, alleviating privacy concerns, regulatory constraints, and dependency on large service providers.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11036

Loading