A Frustratingly Easy Post-Training Quantization Scheme for LLMs

Yongkweon Jeon; Chungman Lee; Kyungphil Park; Ho-young Kim

A Frustratingly Easy Post-Training Quantization Scheme for LLMs

Yongkweon Jeon, Chungman Lee, Kyungphil Park, Ho-young Kim

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Efficient Methods for NLP

Keywords: Quantization, Efficient LLM, Model Compression

TL;DR: We propose a post-training quantization scheme for LLMs

Abstract: Efficient inference has become crucial for hyper-scale AI models, including large language models, as their parameter count continues to increase for enhanced performance. This necessity holds true regardless of the computing environment, whether it be mobile devices or cloud servers. Quantization emerges as a solution to alleviate the computational burden during inference. By representing models with a reduced bit-width, quantization minimizes the frequency of DRAM access while fully exploiting the parallelism of operations through a dense matrix format. Consequently, quantized models achieve low end-to-end latency and optimize resource utilization by addressing both memory and computing bottlenecks. In this paper, we propose a straightforward post-training quantization scheme, called \textsc{Z-Fold}, that fully utilizes the feature of the Transformer structure widely employed in large language models.

Submission Number: 1882

Loading