Abstract: The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models $\mathbf{6.4}-\mathbf{12.7}\times$ on model size and $\mathbf{30.3}-\mathbf{62} \times$ on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by $\mathbf{1.39}-\mathbf{1.79}\times$ and $\mathbf{3.22}-\mathbf{3.43}\times$ of latency and throughput on A100 GPU, and $\mathbf{1.57}-\mathbf{1.69}\times$ and $\mathbf{2.11}-\mathbf{2.51}\times$ improvement of latency and throughput on AGX Orin.
Loading