Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, Dmitry Kalenichenko

2018 (modified: 13 Nov 2024)CVPR 2018Readers: Everyone

Abstract: The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based visual recognition models call for efficient on-device inference schemes. We propose a quantization scheme along with a co-designed training procedure allowing inference to be carried out using integer-only arithmetic while preserving an end-to-end model accuracy that is close to floating-point inference. Inference using integer-only arithmetic performs better than floating-point arithmetic on typical ARM CPUs and can be implemented on integer-arithmetic-only hardware such as mobile accelerators (e.g. Qualcomm Hexagon). By quantizing both activations and weights as 8-bit integers, we obtain a close to 4x memory footprint reduction compared to 32-bit floating-point representations. Even on MobileNets, a model family known for runtime efficiency, our quantization approach results in an improved tradeoff between latency and accuracy on popular ARM CPUs for ImageNet classification and COCO detection.

0 Replies