Multiple Residual Quantization of Pruning

Yuee Zhou, Haidong Kang, Tian Zhang, Lianbo Ma, TieJun Xing

Published: 01 Jan 2022, Last Modified: 12 May 2023DMBD (1) 2022Readers: Everyone

Abstract: Model compression technology investigates the compression of deep neural networks by quantizing the full-precision weights of the network into low-bit ones, to achieve network acceleration. However, most of the existing quantization operations are calculated by simple thresholding operations, which will lead to serious precision loss. In this paper, we propose a new quantization framework combined with pruning, called Multiple Residual Quantization of Pruning (MRQP), to achieve higher precision quantization neural network (QNN). MRQP recursively performs quantization of the full-precision weights by combining the low-bit weights stem and residual parts many times, to minimize the error between the quantized weights and the full-precision weights, and to ensure higher precision quantization. At the same time, MRQP prunes some weights that have less impact on loss function to further reduce model size.

0 Replies