Keywords: Post-Training Quantization, Layer-by-Layer Knowledge Distillation, BERT, GPT-3, GPT-Neox-20B
TL;DR: Our cost-free INT8 post-training quantization can achieve inference speedup on various of large NLP models, including GPT-Neox-20, for which we get 5.2x better efficiency as compared to the FP16 model.
Abstract: How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.
In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as \OURS.
\OURS is an end-to-end quantization and inference pipeline with three main components:
(1) a fine-grained hardware-friendly quantization scheme for both weight and activations;
(2) a novel affordable layer-by-layer knowledge distillation algorithm (\lwd) even without the original training data access;
(3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead.
As such, we are able to show that:
(1) \OURS can reduce the precision for weight and activations to INT8 in a cost-free way for both \bert and \gpt-style
models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on \bert/\gpt-style models compared to FP16 inference, separately;
(2) \OURS plus \lwd can affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model;
(3) \OURS can be directly applied to two of the largest open-sourced language models, including \gptneox, for which our INT8 model achieves similar accuracy as the FP16 model but achieves 5.2x better efficiency.
Our code is open-sourced at~\cite{code_compression}.
Supplementary Material: pdf
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/zeroquant-efficient-and-affordable-post/code)
17 Replies
Loading