Efficient Visual Transformer by Information Bottleneck Inspired Token Merging

Yancheng Wang; Yingzhen Yang

Efficient Visual Transformer by Information Bottleneck Inspired Token Merging

Yancheng Wang, Yingzhen Yang

26 Sept 2024 (modified: 22 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Transformer, Token Merging, Information Bottleneck

TL;DR: We propose Information Bottleneck inspired Token Merging (IBTM), which performs token merging in a learnable manner inspired by the information bottleneck principle and renders efficient vision transformers with competitive performance.

Abstract: Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different types of neural architectures, including those with convolutions, leading to various vision transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Information Bottleneck inspired Token Merging, or IBTM. IBTM performs token merging in a learnable scheme. Our IBTM is compatible with many popular and compact transformer networks, such as MobileViT and EfficientViT, and it reduces the FLOPs and the inference time of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular vision transformers, including MobileViT, EfficientViT, ViT, and Swin, with IBTM blocks, leading to IBTM networks with different backbones. The IBTM is motivated by the reduction of the Information Bottleneck (IB), and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our IBTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on image classification and object detection evidence that IBTM renders compact and efficient vision transformers with comparable or much better prediction accuracy than the original vision transformers. The code of IBTM is available at \url{https://anonymous.4open.science/r/IBTM_Transformers-053B/}.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5327

Loading