Keywords: Extreme Compression, Binary Quantization, Layer Reduction, BERT, Knowledge Distillation, Understanding Quantization, Empirical Investigation
TL;DR: We demonstrate that a simple and effective compression pipeline for extreme Transformer compression can reduce the size of BERT by 50x with minimal accuracy impact
Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.
However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning.
Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.
In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous.
As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained.
Based on our study, we propose a simple yet effective compression pipeline for extreme compression.
Our simplified pipeline demonstrates that
(1) we can skip the pre-training knowledge distillation to obtain a 5-layer \bert while achieving better performance than previous state-of-the-art methods, like TinyBERT;
(2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
Supplementary Material: pdf
13 Replies
Loading