ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Zhewei Yao; Reza Yazdani Aminabadi; Minjia Zhang; Xiaoxia Wu; Conglong Li; Yuxiong He

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

Published: 31 Oct 2022, Last Modified: 04 Aug 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: Post-Training Quantization, Layer-by-Layer Knowledge Distillation, BERT, GPT-3, GPT-Neox-20B

TL;DR: Our cost-free INT8 post-training quantization can achieve inference speedup on various of large NLP models, including GPT-Neox-20, for which we get 5.2x better efficiency as compared to the FP16 model.

Abstract: How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as \OURS. \OURS is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (\lwd) even without the original training data access; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) \OURS can reduce the precision for weight and activations to INT8 in a cost-free way for both \bert and \gpt-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on \bert/\gpt-style models compared to FP16 inference, separately; (2) \OURS plus \lwd can affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) \OURS can be directly applied to two of the largest open-sourced language models, including \gptneox, for which our INT8 model achieves similar accuracy as the FP16 model but achieves 5.2x better efficiency. Our code is open-sourced at~\cite{code_compression}.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/zeroquant-efficient-and-affordable-post/code)

17 Replies

Loading