P-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Huihong Shi; Xin Cheng; Wendong Mao; Zhongfeng Wang

P-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang

Published: 01 Jan 2024, Last Modified: 29 Sept 2024IEEE Trans. Very Large Scale Integr. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision transformers (ViTs) have excelled in computer vision (CV) tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield nonnegligible requantization overhead, limiting ViTs’ hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose P2-ViT, the first power-of-two (PoT) posttraining quantization (PTQ) and acceleration framework to accelerate fully quantized ViTs. Specifically, as for quantization, we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the requantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency tradeoffs. In terms of hardware, we develop a dedicated chunk-based accelerator featuring multiple tailored subprocessors to individually handle ViTs’ different types of operations, alleviating reconfigurable overhead. In addition, we design a tailored row-stationary dataflow to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P2-ViT’s effectiveness. Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared with the counterpart with floating-point scaling factors. Besides, we achieve up to $10.1\times $ speedup and $36.8\times $ energy saving over GPU’s Turing Tensor Cores, and up to $1.84\times $ higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at https://github.com/shihuihong214/P2-ViT .

Loading