Prune or quantize? Strategy for Pareto-optimally low-cost and accurate CNN

Kengo Nakata; Daisuke Miyashita; Asuka Maki; Fumihiko Tachibana; Shinichi Sasaki; Jun Deguchi

Prune or quantize? Strategy for Pareto-optimally low-cost and accurate CNN

Kengo Nakata, Daisuke Miyashita, Asuka Maki, Fumihiko Tachibana, Shinichi Sasaki, Jun Deguchi

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: CNN, Quantization, Pruning, Accelerator, Computational cost

TL;DR: This paper reveals that "prune-then-quantize method" is the best strategy to achieve Pareto-optimal performance by using a proposed hardware-agnostic metric to measure computational cost.

Abstract: Pruning and quantization are typical approaches to reduce the computational cost of CNN inference. Although the idea to combine them together seems natural, it is being unexpectedly difficult to figure out the resultant effect of the combination unless measuring the performance on a certain hardware which a user is going to use. This is because the benefits of pruning and quantization strongly depend on the hardware architecture where the model is executed. For example, a CPU-like architecture without any parallelization may fully exploit the reduction of computations by unstructured pruning for speeding up, but a GPU-like massive parallel architecture would not. Besides, there have been emerging proposals of novel hardware architectures such as one supporting variable bit precision quantization. From an engineering viewpoint, optimization for each hardware architecture is useful and important in practice, but this is quite a brute-force approach. Therefore, in this paper, we first propose hardware-agnostic metric to measure the computational cost. And using the metric, we demonstrate that Pareto-optimal performance, where the best accuracy is obtained at a given computational cost, is achieved when a slim model with smaller number of parameters is quantized moderately rather than a fat model with huge number of parameters is quantized to extremely low bit precision such as binary or ternary. Furthermore, we empirically found the possible quantitative relation between the proposed metric and the signal to noise ratio during SGD training, by which the information obtained during SGD training provides the optimal policy of quantization and pruning. We show the Pareto frontier is improved by 4 times in post-training quantization scenario based on these findings. These findings are available not only to improve the Pareto frontier for accuracy vs. computational cost, but also give us some new insights on deep neural network.

Original Pdf: pdf

15 Replies

Loading