From Comprehensive Study to Low-Rank Compensation: Exploring Post-Training Quantization in LLMs

Zhewei Yao; Xiaoxia Wu; Cheng Li; stephen youn; Yuxiong He

From Comprehensive Study to Low-Rank Compensation: Exploring Post-Training Quantization in LLMs

Zhewei Yao, Xiaoxia Wu, Cheng Li, stephen youn, Yuxiong He

11 May 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: Understanding Quantization, Int8, Empirical Investigation, Post-Training Quantization

TL;DR: This paper investigates the impact of post-training quantization on OPT and BLOOM model families, compares various PTQ methods, and introduces a novel Low Rank Compensation technique for improved model quality recovery with minimal size increase.

Abstract: Post-training quantization (PTQ) has recently been demonstrated as a viable method to reduce memory consumption and compute cost for large language models. However, a comprehensive study on the effect of different quantization schemes, model families, and quantization bit precision has been lacking. In this work, we provide an extensive analysis of these components. We examine the impact of PTQ on weight-only, activation-only, and weight-and-activation quantization using various methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants, applied to two different model families (OPT and BLOOM) with sizes ranging from 125M to 176B. We contribute by: (1) conducting a sensitivity analysis, revealing that activation quantization is generally more sensitive to weight quantization, and smaller models typically perform better than larger models in terms of activation quantization; (2) evaluating and comparing existing PTQ methods to optimize model size reduction and minimize accuracy impact, discovering that current methods can hardly achieve original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we optimize existing methods and introduce a technique called Low Rank Compensation (LoRC), which uses low-rank matrix to enhance model quality recovery with a negligible increase in model size

Supplementary Material: pdf

Submission Number: 9451

Loading