Abstract: Transformer-based language models have shown outstanding performance in various NLP tasks, but using them on edge devices is very challenging due to their notorious memory usage. To address this issue, this paper proposes a novel parameter quantization method for BERT that quantizes important parameters with higher precision bit width and unimportant parameters with lower precision bit width using a Hessian-based sensitivity metric. The experimental results show that our method achieves 19.6X compression of the model parameters in BERT with a 0.8% accuracy drop on MNLI compared to the BERT base model, generally outperforming other existing layer-wise quantization methods.
Loading