Submission Type: Regular Long Paper
Submission Track: Theme Track: Large Language Models and the Future of NLP
Submission Track 2: Efficient Methods for NLP
Keywords: model quantization
Abstract: Large language models (LLMs) have proven to be very superior to conventional methods in various tasks.
However, their expensive computations and high memory requirements are prohibitive for deployment.
Model quantization is an effective method for reducing this overhead. The problem is that in most
previous works, the quantized model was calibrated using few samples from the training data, which
might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work,
we explore an important question: Can we design a data-independent quantization method for LLMs to
guarantee its generalization performance?
In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization
algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization
ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers
(less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With
these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model.
Since EasyQuant does not depend on any training data, the generalization performance of quantized
LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized
model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the
first work that achieves almost lossless quantization performance for LLMs under a data-independent
setting and our algorithm runs over 10 times faster than the data-dependent methods.
Submission Number: 1353
Loading