Lattice-Based Vector Quantization for Low-Bit Quantization-Aware Training

Rishika Kohli; Soma S Dhavala; Shaifu Gupta; Manoj Singh Gaur

Lattice-Based Vector Quantization for Low-Bit Quantization-Aware Training

Rishika Kohli, Soma S Dhavala, Shaifu Gupta, Manoj Singh Gaur

Published: 22 Jan 2026, Last Modified: 06 Mar 2026CPAL 2026 (Proceedings Track) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: compression, quantization, pruning, deep learning, vector quantization, quantization aware training, post training quantization, BERT

TL;DR: Structured lattice-based vector quantization enables stable and accurate quantization-aware training at low bit-rates

Abstract: Quantization is an effective approach for deploying deep learning models on resource-constrained hardware, but maintaining accuracy and training stability at extreme low precision remains a major challenge. In this work, we study lattice-based vector quantization (VQ) as a practical alternative to scalar quantization for low-bit quantization-aware training (QAT). We develop a unified quantization pipeline that integrates structured lattice projections into both QAT and post-training quantization (PTQ), supporting multiple lattice choices—including E8 and D4—via a fused projection operator with straight-through estimation. Through extensive experiments across a wide range of bit-widths, lattice parameterizations, and training regimes, we show that lattice-based VQ consistently enables stable training and meaningful accuracy below 2 bits, where scalar quantization and existing PTQ methods typically underperform or are unavailable. In this low-bit regime, exploiting geometric structure across weight blocks improves robustness by reducing overload and stabilizing optimization, while at moderate and higher bit-widths, performance differences narrow and simpler quantization schemes become sufficient. We further analyze the role of lattice choice, dynamic-range scaling, and overload behavior, and demonstrate that explicit overload control is central to reliable low-bit performance. Finally, we show that lattice-based QAT extends beyond binary classification and weight-only quantization, supporting multi-class tasks, joint weight–activation quantization, and transformer encoders such as BERT, achieving substantial compression with controlled accuracy degradation

Submission Number: 113

Loading