Token Selection is a Simple Booster for Vision Transformers

Published: 01 Jan 2023, Last Modified: 05 Mar 2025IEEE Trans. Pattern Anal. Mach. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision transformers have recently attained state-of-the-art results in visual recognition tasks. Their success is largely attributed to the self-attention component, which models the global dependencies among the image patches (tokens) and aggregates them into higher-level features. However, self-attention brings significant training difficulties to ViTs. Many recent works thus develop various new self-attention components to alleviate this issue. In this article, instead of developing complicated self-attention mechanism, we aim to explore simple approaches to fully release the potential of the vanilla self-attention. We first study the token selection behavior of self-attention and find that it suffers from a low diversity due to attention over-smoothing, which severely limits its effectiveness in learning discriminative token features. We then develop simple approaches to enhance selectivity and diversity for self-attention in token selection. The resulted token selector module can server as a drop-in module for various ViT backbones and consistently boost their performance. Significantly, they enable ViTs to achieve 84.6% top-1 classification accuracy on ImageNet with only 25M parameters. When scaled up to 81M parameters, the result can be further improved to 86.1%. In addition, we also present comprehensive experiments to demonstrate the token selector can be applied to a variety of transformer-based models to boost their performance for image classification, semantic segmentation and NLP tasks. Code is available at https://github.com/zhoudaquan/dvit_repo .
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview