AWRQ: Activation-aware Weight Reformulation Quantizer for Large Language Models

Lin Zhao

AWRQ: Activation-aware Weight Reformulation Quantizer for Large Language Models

Lin Zhao

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: LLMs, Quantization, Weight Reformulation, Low-bits, Block-wise

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a low-bit quantization method for both weights and activations and speedup the method by block-wise technique.

Abstract: Large Language Models (LLMs) have shown great potential applications in many fields for their excellent performance on various tasks, but the usage is also limited by the huge computational and storage cost. To address this challenge, quantization is a promising way and many insightful methods have been proposed, including GPTQ, a post-training quantization method which achieves the state-of-the-art result for low bit quantization of LLMs. However, this method only focuses on generative tasks which predicts an output token given an input prompt, and does not consider the scenario that generates a list of tokens in sequential process which requires quantization of activations in practice. In this paper, we extend GPTQ to the quantization of both weights and activations and propose Activation-aware Weight Reformulation Quantizer (AWRQ) which transfers quantization errors of activations to weights and then quantizes weights by solving series of minimal problems. GPTQ is low efficient in calibration because it quantizes only one column of weights at each iteration step, we speedup this by the block-wise technique which quantizes a few columns in parallel at each time. As low-bit quantization of activations may lead to accuracy collapse for LLMs, we perform SmoothQuant before our experiments. This allows us to implement W4A6 (weight 4-bits, activation 6-bits) quantization of LLMs for the first time. We implement our method on OPT and LLAMA benchmarks and demonstrate that, the models have only a bit accuracy loss when quantized to W4A8 and get state-of-the-art accuracy when quantized to W4A6. To speedup the algorithm, we use block-wise technique and obtain 4$\times$ speedups with no accuracy degradation when quantizing OPT13B. Our algorithm is one-shot and hardware-friendly, which makes it highly efficient for quantization and deployment.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5246

Loading