Keywords: Quantization, Rounding
TL;DR: Considering directional information during activation quantization reduces the quantization error of matrix multiplication.
Abstract: How can we accelerate inference of matrix multiplications while maintaining the performance of neural networks?
Weight-activation quantization reduces inference costs by quantizing both weights and activations, enabling cheaper matrix multiplications during inference.
Previous researches on weight-activation quantization have focused on finding better weights to reduce quantization errors,
while simply applying round-to-nearest (RTN) for the activations during inference.
However, RTN has limitations in preserving the directional information of activations, which is crucial to accurately approximate matrix multiplications.
In this paper, we propose DiaQ, an accurate method for quantizing activations while preserving directional information.
DiaQ chooses the direction to round each value based on their direction as well as their distance from the quantization levels.
DiaQ also extends each vector to prevent collapse during quantization and corrects the output scale to compensate for the change in magnitude after quantization.
Extensive experiments show that DiaQ reduces the quantization error induced from activation quantization by up to 13.3% and 26.1% in terms of Euclidean and cosine distances, respectively, compared to RTN.
DiaQ also improves the task performances of LLMs and ViTs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16759
Loading