HyQ: Hardware-Friendly Post-Training Quantization for CNN-Transformer Hybrid Networks

Nam Joon Kim; Jongho Lee; Hyun Kim

HyQ: Hardware-Friendly Post-Training Quantization for CNN-Transformer Hybrid Networks

Nam Joon Kim, Jongho Lee, Hyun Kim

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IJCAI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Hybrid models that combine CNNs and ViTs have recently emerged as state-of-the-art computer vision models. To efficiently deploy these hybrid models on resource-constrained mobile/edge devices, quantization is emerging as a promising solution. However, post-training quantization (PTQ), which does not require retraining or labeled data, has not been extensively studied for hybrid models. In this study, we propose a novel PTQ technique specialized for CNN-transformer hybrid models by considering the hardware design of hybrid models on AI accelerators such as GPUs and FPGAs. First, we introduce quantization-aware distribution scaling to address the large outliers caused by inter-channel variance in convolution layers. Furthermore, in the transformer block, we propose approximating the integer-only softmax with a linear function. This approach allows us to avoid costly FP32/INT32 multiplications, resulting in more efficient computations. Experimental results show that the proposed quantization method with INT8 precision demonstrated a 0.39% accuracy drop compared with the FP32 baseline on MobileViT-s with the ImageNet-1k dataset. Furthermore, when implemented on the FPGA platform, the proposed linear softmax achieved significant resource savings, reducing the look-up table and flip-flop usage by 1.8 ~ 2.1x and 1.3 ~ 1.9x, respectively, compared with the existing second-order polynomial approximation. The code is available at https://github.com/IDSL-SeoulTech/HyQ.

Loading