Keywords: Understanding Quantization, Int8, Empirical Investigation, Post-Training Quantization
TL;DR: This paper investigates the impact of post-training quantization on OPT and BLOOM model families, compares various PTQ methods, and introduces a novel Low Rank Compensation technique for improved model quality recovery with minimal size increase.
Abstract: Post-training quantization (PTQ) has recently been demonstrated as a viable method to reduce memory consumption and compute cost for large language models. However, a comprehensive study on the effect of different quantization schemes, model families, and quantization bit precision has been lacking. In this work, we provide an extensive analysis of these components. We examine the impact of PTQ on weight-only, activation-only, and weight-and-activation quantization using various methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants, applied to two different model families (OPT and BLOOM) with sizes ranging from 125M to 176B. We contribute by: (1) conducting a sensitivity analysis, revealing that activation quantization is generally more sensitive to weight quantization, and smaller models typically perform better than larger models in terms of activation quantization; (2) evaluating and comparing existing PTQ methods to optimize model size reduction and minimize accuracy impact, discovering that current methods can hardly achieve original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we optimize existing methods and introduce a technique called Low Rank Compensation (LoRC), which uses low-rank matrix to enhance model quality recovery with a negligible increase in model size
Supplementary Material: pdf
Submission Number: 9451
Loading