Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers With Bridge Block Reconstruction for IoT Systems

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song

Published: 01 Jan 2024, Last Modified: 15 Jul 2025IEEE Internet Things J. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, posttraining quantization (PTQ) has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this article, we discover that applying existing PTQ methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: 1) highly dynamic ranges; 2) zero-point overflow; 3) diverse normalization; and 4) limited model parameters (<5M). To overcome these challenges, we propose a new PTQ method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, and EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with the existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT). We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.