Hybrid and Non-Uniform DNN quantization methods using Retro Synthesis data for efficient inference

TEJPRATAP GVSL; Raja Kumar; Pradeep NS

Hybrid and Non-Uniform DNN quantization methods using Retro Synthesis data for efficient inference

TEJPRATAP GVSL, Raja Kumar, Pradeep NS

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: quantization, dnn inference, data free quantization, synthetic data, model compression

Abstract: Existing post-training quantization methods attempt to compensate for the quantization loss by determining the quantized weights and activation ranges with the help of training data. Quantization aware training methods, on the other hand, achieve accuracy near to FP32 models by training the quantized model which consume more time. Both these methods are not effective for privacy constraint applications as they are tightly coupled with training data. In contrast, this paper proposes a data-independent post-training quantization scheme that eliminates the need for training data. This is achieved by generating a faux dataset hereafter called as $\textit{‘Retro-Synthesis Data’}$ from the FP32 model layer statistics and further using it for quantization. This approach outperformed state-of-the-art methods including, but not limited to, ZeroQ and DFQ on models with and without batch-normalization layers for 8, 6 and 4 bit precisions. We also introduced two futuristic variants of post-training quantization methods namely $\textit{‘Hybrid-Quantization’}$ and $\textit{‘Non-Uniform Quantization’}$. The Hybrid-Quantization scheme determines the sensitivity of each layer for per-tensor and per-channel quantization, and thereby generates hybrid quantized models that are $10 - 20\%$ efficient in inference time while achieving same or better accuracy as compared to per-channel quantization. Also this method outperformed FP32 accuracy when applied for models such as ResNet-18, and ResNet-50 onImageNet dataset. In the proposed Non-Uniform quantization scheme, the weights are grouped into different clusters and these clusters are assigned with a varied number of quantization steps depending on the number of weights and their ranges in respective cluster. This method resulted in an accuracy improvement of $1\%$ against state-of-the-art quantization methods on ImageNet dataset.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=U3Nouv1bg8

13 Replies

Loading