Keywords: Post-Training Quantization, Deep Learning, Image Classification, Convolutional Neural Networks
TL;DR: We develop a novel Error Reduction approach for Post-Training Quantization via Cluster-based Affine Transformation
Abstract: Post-Training Quantization (PTQ) reduces the memory footprint and computational overhead of deep neural networks by converting full-precision (FP) values into quantized and compressed data types.
While PTQ is more cost-efficient than Quantization-Aware Training (QAT), it is highly susceptible to accuracy degradation under a low-bit quantization (LQ) regime (e.g., 2-bit and 4-bit).
Affine transformation is a classical technique used to reduce the discrepancy between the information processed by a quantized model and that processed by its full-precision counterpart; however, we find that using plain affine transformation, which applies a uniform affine parameter set for all outputs, is ineffective in low-bit PTQ.
To address this, we propose Cluster-based Affine Transformation (CAT), an error reduction framework that applies cluster-specific affine transformation to align LQ and FP outputs.
CAT directly refines quantized outputs with only a negligible number of additional parameters.
Experiments on ImageNet-1K demonstrate that CAT consistently outperforms prior PTQ methods across diverse architectures and low-bit settings, achieving up to 53.18\% Top-1 accuracy on W2A2 ResNet-18, and delivering improvements of more than 3\% when combined with strong PTQ baselines.
We plan to release CAT’s code alongside the publication of this paper.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10973
Loading