Keywords: quantization, model compression, rate-distortion theory, compression
TL;DR: We propose a compression method for pre-trained neural networks that combines quantization and entropy based neural network compression.
Abstract: The proliferation of large pre-trained neural networks has recently revived research in both quantization of network weights (for faster inference), and in their
compression (to reduce file sizes). However, there has so far been little idea transfer between the two lines of research. In this paper, we combine techniques from
quantization and compression to propose an efficient and highly effective post-training compression method for large neural networks. Our method extends the
recently published quantization method OPTQ (Frantar et al., 2023) with a tunable
rate/distortion trade-off by introducing a cost per bit into OPTQ's rounding
operation. Crucially, we estimate the bit rate based on the predictive model used
in the state-of-the-art neural network compression method NNCodec (Becking
et al., 2023). In our experiments with several standard pre-trained networks from
the computer vision community, our method leads to significantly (up to 2.7x)
smaller file sizes than NNCodec at equal model performance, generally compressing to less than half a bit per network weight and implicitly pruning insignificant weights.
Additionally, and in contrast to NNcodec, our method offers the same opportunities for inference speed-ups as OPTQ. By proving that file size and inference
cost can be reduced simultaneously, we hope that our contribution shows a path
towards deploying large neural networks on end-user devices, alleviating privacy
concerns, regulatory constraints, and dependency on large service providers.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11036
Loading