Beyond Weight-Only: Mixed-Precision Quantization for BERT Weights, Activations and Embeddings

17 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization, Mixed-precision, Inference, Embedding quantization, Quantization-Aware Training, NLP, BERT
TL;DR: A quantization-aware training tool for mixed-precision inference of LLMs
Abstract: Pre-trained language models deliver strong performance across various Natural Language Processing (NLP) tasks but remain costly to deploy due to memory and compute demands. To address this, model compression techniques such as pruning, knowledge distillation, and quantization have emerged, with quantization gaining traction due to hardware support for low precision. While uniform and extremely low-precision quantization have shown promise, mixed-precision approaches that assign variable bit-widths to weights/activations across the model offer a superior balance between compression and accuracy. In this work, we aim to evaluate the impact of mixed-precision quantization for inference on BERT language model. Unlike prior work that often neglects activation quantization, our study systematically explores both weights and activations in mixed-precision configurations. To further improve performance, we integrate knowledge distillation into the mixed-precision pipeline. We also evaluate the impact of quantization on the embedding layer, which is generally restricted solely to quantizing token weights. Evaluated on SQuAD and GLUE benchmarks, our results achieve substantial memory and computational reductions without sacrificing accuracy.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8946
Loading