Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision tasks. However these models are memory-consuming and computation-intensive, making their deployment and efficient inference on edge devices challenging. Model quantization is a promising approach to reduce model complexity. Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point (FP) scaling factors, which not only yield non-negligible re-quantization overhead, but also hinder the quantized models to perform efficient integer-only inference. In this paper, we propose H-ViT, a dedicated post-training quantization scheme (e.g., symmetric uniform quantization and layer-wise quantization for both weights and part of activations) to effectively quantize ViTs with fewer Power-of-Two (PoT) scaling factors, thus minimizing the re-quantization overhead and memory consumption. In addition, observing serious inter-channel variation in LayerNorm inputs and outputs, we propose Power-of-Two quantization (PTQ), a systematic method to reducing the performance degradation without hyper-parameters. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that H-ViT offers comparable(or even slightly higher) INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. For instance, we reach 78.43 top-1 accuracy with DeiT-S on ImageNet, 51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN (Swin-B) on COCO.
External IDs:dblp:journals/auinsy/LiuLDJZ25
Loading