A new framework for LLM post-training weight quantization

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization, LLM
Abstract: Despite their success, the accessibility of Large Language Models (LLMs) is curbed by their significant memory and computational power requirements. Weight quantization, a technique that represents model weights using lower precision, is promising to alleviate these demands while preserving the model’s performance. Quantization not only enables a reduced memory footprint, but also accelerates inference time as memory fetch latency is reduced. As a result, quantized models benefit from reduced energy consumption and lower running costs. In this talk, we will present a new post-training weight quantization framework that enables to deal with outliers, namely, weights exceptionally high magnitudes that reside in the tails of the weights distribution. Our work leverages statistical properties of outliers as well as index coding to efficiently compress to very low bit rates, while maintaining the LLM performance. We conduct extensive experiments showing that ICQuant can significantly improve the quantization quality, in 2-4 bits regimes, even of simple scalar quantizers achieving comparable results to state-of-the-art schemes that rely on computationally intensive vector quantization and expensive fine-tuning procedures. This work has been published on ArXiv and is under submission at a conference.
Submission Number: 108
Loading