Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Dhruv Parikh; Shouyi Li; Bingyi Zhang; Rajgopal Kannan; Carl E. Busart; Viktor K. Prasanna

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl E. Busart, Viktor K. Prasanna

Published: 01 Jan 2024, Last Modified: 17 Jan 2025FCCM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning methods are well-known in reducing ViT model complexity. However, naively combining and integrating both the methods results in irregular computation patterns leading to accuracy drops and difficulties in hardware acceleration. This limits the net complexity reduction offered by integrating such pruning methods. To address the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning - combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to reduce the accuracy drop due to such simultaneous pruning. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with a load-balancing strategy to efficiently deal with the irregular computation pattern presented by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for executing the on-the-fly token pruning. We apply our codesign approach to the widely used DeiT-Small model. We implement the proposed accelerator on a state-of-the-art FPGA. The evaluation results show that the proposed algorithm reduces computation complexity by up to 3.4× with ≈ 3% accuracy drop and a model compression ratio of up to 1.6×. Compared with state-of-the-art implementation on CPU, GPU, and FPGA, our codesign on FPGA achieves an average latency reduction of 12.8×, 3.2×, and 0.7 – 2.1×, respectively.

Loading