Keywords: Quantization, ViT, Efficient, Accelerator
Abstract: Vision Transformer (ViT) has achieved significant success in computer vision, in which EfficientViT is widely used because of its lightweight characteristics.
However, EfficientViT is still difficult to deploy on edge devices like FPGA because of its efficiency and accuracy concerns.
First, from software perspective, existing quantization approaches fail to consider the inter-channel distribution relationship, which cause significant performance degradation under lower-bit setting.
Second, from hardware perspective, current DSP-packing methods struggle to support the diverse kernel sizes and strides of convolutions used in EfficientViT, resulting in redundant computation cycles or bit-width overflow.
Moreover, due to the mismatch in data layouts between convolution and linear attention, existing solutions require substantial memory resources for data reordering, which often results in pipeline stalling.
In this paper, we propose a Quantization and Streamline Co-Design (QuS) framework for lower-bit EfficientViT deployment on FPGA.
It includes three main components: adaptive distribution-aware quantization strategy to provide effective quantization, multi-computing in once packing strategy to improve the DSP-packing efficiency, and low-buffer streamline for linear attention scheme to eliminate pipeline stalling caused by mismatched layout.
Experimental results show that our QuS framework achieves over 2200 FPS on EfficientViT, which represents a $3.6\times$ speedup over Jetson AGX Orin and also up to a $24\%$ accuracy improvement under 4-bit quantization.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 3865
Loading