Matryoshka Quantization

Pranav Ajit Nair; Puranjay Datta; Jeff Dean; Prateek Jain; Aditya Kusupati

Matryoshka Quantization

Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Matryoshka Quantization (MatQuant) trains a single model that can operate at multiple bitwidths (e.g. Int8, Int4, Int2) simultaneously by leveraging the nested structure of integer data types, outperforming independent quantization of the same models

Abstract: Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models – especially to low precisions like int4 or int2 – requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit improves further by 6% with OmniQuant as the base algorithm.

Lay Summary: Large artificial intelligence models demand substantial memory for their intricate weight parameters. Quantization, representing these weights with lower bit precision, mitigates this memory footprint but often at the cost of reduced accuracy. Consequently, practitioners frequently maintain multiple distinct model versions, each tailored to specific trade-offs between accuracy and computational speed, posing a practical challenge. Matryoshka Quantization (MatQuant) presents an innovative solution: a single, unified model trained with nested precision levels, conceptually akin to Russian Matryoshka dolls. In this framework, models with lower bit precision are intrinsically embedded within their higher-precision counterparts. These more compact versions can be readily accessed by selectively "slicing" the appropriate weight parameters from the larger model. This approach provides deployment flexibility, enabling the model to operate at various precision levels for either high-quality results or faster, compact execution. Notably, MatQuant's joint training process surprisingly improves the performance of the lower-bit precision models compared to when they are trained independently, offering a significant advantage beyond mere efficiency.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Algorithms

Keywords: Matryoshka, Quantization, LLMs, Deployment

Submission Number: 442

Loading