TL;DR: Matryoshka Quantization (MatQuant) trains a single model that can operate at multiple bitwidths (e.g. Int8, Int4, Int2) simultaneously by leveraging the nested structure of integer data types, outperforming independent quantization of the same models
Abstract: Quantizing model weights is critical for reducing
the communication and inference costs of large
models. However, quantizing models – especially
to low precisions like int4 or int2 – requires a
trade-off in model quality; int2, in particular, is
known to severely degrade model quality. Consequently, practitioners are often forced to maintain
multiple models with different quantization levels or serve a single model that best satisfies the
quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess
a nested (Matryoshka) structure where smaller
bit-width integers, like int4 or int2, are nested
within the most significant bits. Leveraging this
insight, in this paper, we propose Matryoshka
Quantization (MatQuant), a novel multi-scale
quantization technique that alleviates the aforementioned challenge. This technique allows us to
train and maintain a single quantized model but
serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s
co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard
int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively.
Finally, we demonstrate that by using an extra bit
to represent outliers, a model with an effective
precision of 2.05-bit improves further by 6% with
OmniQuant as the base algorithm.
Lay Summary: Large artificial intelligence models demand substantial memory for their intricate weight parameters. Quantization, representing these weights with lower bit precision, mitigates this memory footprint but often at the cost of reduced accuracy. Consequently, practitioners frequently maintain multiple distinct model versions, each tailored to specific trade-offs between accuracy and computational speed, posing a practical challenge.
Matryoshka Quantization (MatQuant) presents an innovative solution: a single, unified model trained with nested precision levels, conceptually akin to Russian Matryoshka dolls. In this framework, models with lower bit precision are intrinsically embedded within their higher-precision counterparts. These more compact versions can be readily accessed by selectively "slicing" the appropriate weight parameters from the larger model.
This approach provides deployment flexibility, enabling the model to operate at various precision levels for either high-quality results or faster, compact execution. Notably, MatQuant's joint training process surprisingly improves the performance of the lower-bit precision models compared to when they are trained independently, offering a significant advantage beyond mere efficiency.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Algorithms
Keywords: Matryoshka, Quantization, LLMs, Deployment
Submission Number: 442
Loading