Abstract: Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, providing superior performance in various vision tasks. However, the high computational complexity poses a significant barrier to practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning or only reduce duplicative redundancy by token merging. To this end, in this paper, we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the complexity of the model while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We restructured the literature survey into thematic subsections and updated it with recent works. The preliminaries were revised to include multi-head self-attention (MHSA) with updated equations for multi-head scenarios. We clarified the mathematical notations for token pruning and pooling, including the role of the scale $s$. A new section was added on adapting our method to multi-stage hierarchical architectures (e.g., Swin, PvT). Additionally, we included a discussion on extending the approach to dense prediction tasks and its feasibility as future work.
Assigned Action Editor: ~Wei_Liu3
Submission Number: 3025
Loading