ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale TransformersDownload PDF

Published: 31 Oct 2022, 18:00, Last Modified: 15 Dec 2022, 18:32NeurIPS 2022 AcceptReaders: Everyone
Keywords: Post-Training Quantization, Layer-by-Layer Knowledge Distillation, BERT, GPT-3, GPT-Neox-20B
TL;DR: Our cost-free INT8 post-training quantization can achieve inference speedup on various of large NLP models, including GPT-Neox-20, for which we get 5.2x better efficiency as compared to the FP16 model.
Abstract: How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as \OURS. \OURS is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (\lwd) even without the original training data access; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) \OURS can reduce the precision for weight and activations to INT8 in a cost-free way for both \bert and \gpt-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on \bert/\gpt-style models compared to FP16 inference, separately; (2) \OURS plus \lwd can affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) \OURS can be directly applied to two of the largest open-sourced language models, including \gptneox, for which our INT8 model achieves similar accuracy as the FP16 model but achieves 5.2x better efficiency. Our code is open-sourced at~\cite{code_compression}.
Supplementary Material: pdf
17 Replies

Loading