NestQuant: nested lattice quantization for matrix products and LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Nested lattice codes for LLM quantization
Abstract: Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55\% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta's SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.
Lay Summary: Large language models (LLMs) like ChatGPT are powerful but require a lot of memory and energy to run, making them expensive and difficult to deploy on everyday hardware. One way to make them more efficient is by compressing their internal data using fewer bits, similar to how JPEG reduces a a multi-MB high-resolution photo to a few hundred KB without losing too much detail. In this work, we propose a new lossy compression method called NestQuant for LLMs. Most existing methods use a simple rounding-to-nearest (RTN) integer method, whereas NestQuant is built on top of a mathematical idea of nested lattices. The improvement of NestQuant is similar to how one prefers stacking oranges in a pyramid instead of in a cubic grid. In our case, the tighter packing of points corresponds to less information being lost during compression. In the end, our method significantly improves the state-of-the-art in LLM quantization. For example, when tested on a popular LLM (Llama-3 with sizes ranging from 8B to 70B) and at target quantization of 4 bits, NestQuant universally outperformed (sometimes by a factor of 2 or 3) all contemporary algorithms. Better quantization enables much more efficient (in energy spent, cost of hardware etc) deployment of advanced AI, thus broadening access to it.
Link To Code: https://github.com/cookiedoth/nestquant
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, quantization
Submission Number: 13986
Loading