Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Jinhao Li, Jiaming Xu, Shiyao Li, Shan Huang, Jun Liu, Yaoxiu Lian, Guohao Dai

Published: 27 Oct 2024, Last Modified: 25 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory consumption. Applying 2-bit single-precision weight quantization brings >3% accuracy loss, so the state-of-the-art methods use mixed-precision methods for LLMs (e.g. Llama2-7b, etc.) to improve the accuracy. However, challenges still exist: (1) Uneven distribution in weight matrix. Weights are quantized by groups, while some groups contain weights with large range. Previous methods apply inter-weight mixed-precision quantization and neglect the range difference inside each weight matrix, resulting in >2.7% accuracy loss (e.g. LLM-MQ and APTQ). (2) Large speed degradation by adding sparse outliers. Reserving sparse outliers improves accuracy but slows down the speed affected by the outlier ratio (e.g. 1.5% outliers resulting in >30% speed degradation in SpQR). (3) Time-consuming dequantization operations on GPUs. Mainstream methods require a dequantization operation to perform computation on the quantized weights, and the 2-order dequantization operation is applied because scales of groups are also quantized. These dequantization operations lead to >50% execution time.To tackle these challenges and enable fast and efficient LLM inference on GPUs, we propose the following techniques in this paper. (1) Intra-weight mixed-precision quantization. We only quantize a small fraction of groups with higher sensitivity (larger Hessian value and range variation) using 4-bit. Meanwhile, we also take the memory alignment into consideration on GPUs. (2) Exclusive 2-bit sparse outlier with minimum speed degradation. We only reserve a small fraction of large weights in 2-bit groups as sparse outliers using 16-bit, which leads to a lower average bit increment and speed degradation. (3) Asynchronous dequantization. We point out that calculating the scales of each group in 2-order de-quantization is independent of the loading weights of each group in 1-order dequantization. Thus, we design the asynchronous dequantization on GPUs. We conduct extensive experiments on different model families (e.g. Llama3, etc.) and model sizes. We achieve 2.91-bit for each weight considering all scales/zeros for different models with negligible loss. As a result, with our 2/4/16 mixed-precision quantization for each weight matrix and asynchronous dequantization during inference, our design achieves an end-to-end speedup for Llama2-7b is 1.74× over the original model, and we reduce both runtime cost and total cost by up to 2.53× and 2.29× with less GPU requirements.

External IDs:doi:10.1145/3676536.3676796