- TL;DR: Soft quantization approach to learn pure fixed-point representations of deep neural networks
- Abstract: Deep neural networks (DNNs) dominate current research in machine learning. Due to massive GPU parallelization DNN training is no longer a bottleneck, and large models with many parameters and high computational effort lead common benchmark tables. In contrast, embedded devices have a very limited capability. As a result, both model size and inference time must be significantly reduced if DNNs are to achieve suitable performance on embedded devices. We propose a soft quantization approach to train DNNs that can be evaluated using pure fixed-point arithmetic. By exploiting the bit-shift mechanism, we derive fixed-point quantization constraints for all important components, including batch normalization and ReLU. Compared to floating-point arithmetic, fixed-point calculations significantly reduce computational effort whereas low-bit representations immediately decrease memory costs. We evaluate our approach with different architectures on common benchmark data sets and compare with recent quantization approaches. We achieve new state of the art performance using 4-bit fixed-point models with an error rate of 4.98% on CIFAR-10.
- Keywords: Deep neural networks, fixed-point quantization, bit-shift, soft quantization
- Original Pdf: pdf