N-Ary Quantization for CNN Model Compression and Inference AccelerationDownload PDF

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone
Abstract: The tremendous memory and computational complexity of Convolutional Neural Networks (CNNs) prevents the inference deployment on resource-constrained systems. As a result, recent research focused on CNN optimization techniques, in particular quantization, which allows weights and activations of layers to be represented with just a few bits while achieving impressive prediction performance. However, aggressive quantization techniques still fail to achieve full-precision prediction performance on state-of-the-art CNN architectures on large-scale classification tasks. In this work we propose a method for weight and activation quantization that is scalable in terms of quantization levels (n-ary representations) and easy to compute while maintaining the performance close to full-precision CNNs. Our weight quantization scheme is based on trainable scaling factors and a nested-means clustering strategy which is robust to weight updates and therefore exhibits good convergence properties. The flexibility of nested-means clustering enables exploration of various n-ary weight representations with the potential of high parameter compression. For activations, we propose a linear quantization strategy that takes the statistical properties of batch normalization into account. We demonstrate the effectiveness of our approach using state-of-the-art models on ImageNet.
Keywords: low-resource deep neural networks, quantized weights, weight-clustering, resource efficient neural networks
TL;DR: We propose a quantization scheme for weights and activations of deep neural networks. This reduces the memory footprint substantially and accelerates inference.
10 Replies

Loading