DC-MPQ: Distributional Clipping-based Mixed-Precision Quantization for Convolutional Neural Networks

Seungjin Lee, Hyun Kim

2022 (modified: 01 Feb 2023)AICAS 2022Readers: Everyone

Abstract: Quantization is a representative network compression technique that reduces the number of computational operations and memory accesses in the computation process of convolutional neural networks (CNNs). The existing naïve quantization method has a problem in that the quantization point corresponding to the near-zero value decreases as the precision decreases; as a result, the quantization error increases. Recent quantization-related studies have suggested various solutions to this problem. Nevertheless, studies that suggest a method to solve this problem by considering the characteristics of hardware accelerator implementation have not been actively conducted. To address this problem, this study proposes a method of using standard deviation values, which are simple statistical values of distribution for each layer, as clipping points and setting a scale factor with a clipping point as the base to quantize the weights into a mixed-precision integer format of 4-bit/8-bit. The proposed technique can be applied to any network without additional training, and only biasing and mapping are performed based on the pre-stored standard deviation values; thus, the computational complexity is low, rendering it hardware-friendly. Experimental results indicate that the proposed mixed-precision quantization of the weights of ResNet-18 on ImageNet achieved an effect of reducing the weight capacity by 84% with a 0.34% Top-1 accuracy drop compared to full precision. In YOLACT, an instance segmentation model using a ResNet-50 backbone, on MS COCO, a weight capacity reduction of 81.7% was achieved with only 0.27% and 0.19% drops in box mean average precision (mAP) and mask mAP, respectively.

0 Replies