Keywords: Quantization, Compression, Deep Neural Network
Abstract: This paper proposes a novel matrix quantization method, Binary Quadratic Quan-
tization (BQQ). In contrast to conventional first-order quantization approaches—
such as uniform quantization and binary coding quantization—that approximate
real-valued matrices via linear combinations of binary bases, BQQ leverages the
expressive power of binary quadratic expressions while maintaining an extremely
compact data format. We validate our approach with two experiments: a matrix
compression benchmark and post-training quantization (PTQ) on pretrained Vision
Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction
error than conventional methods for compressing diverse matrix data. It also
delivers strong PTQ performance, even though we neither target state-of-the-art
PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary
matrix optimization. For example, our proposed method outperforms the state-of-
the-art PTQ method by up to 2.2% and 59.1% on the ImageNet dataset under the
calibration-based and data-free scenarios, respectively, with quantization equivalent
to 2 bits. These findings highlight the surprising effectiveness of binary quadratic
expressions for efficient matrix approximation and neural network compression.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 17010
Loading