Abstract: Quantization is an effective technique for reducing memory usage and power consumption in deep neural networks (DNNs) by decreasing parameter size. However, conventional quantization methods often lead to significant accuracy loss when applied to compact architectures such as MobileNet. In particular, quantizing MobileNetV3 causes accuracy degradation due to the presence of large outliers. To address this challenge, we propose a hardware-friendly mixed-precision quantization approach. Unlike existing methods, which suffer from low memory and computational efficiency due to the use of diverse bit-widths that do not align with memory address space sizes, our approach applies 8-bit quantization to activations and selectively quantizes weights to 4-, 8-, or 16-bit, depending on the sensitivity of each layer. This strategy not only enhances memory and computational efficiency but also minimizes accuracy degradation. When evaluated on the ImageNet-1k dataset, our proposed method reduces the parameter count of MobileNetV3-small and MobileNetV3-large by 78.31% and 75.61%, respectively, while achieving accuracy drops of only 0.90% and 0.81%.
Loading