TransNeXt: Aggregating Diverse Attentions in One Vision Model

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Vision Transformer, Efficient Transformer, Self-Attention, Convolution, Visual Backbone, Image Classification, Object Detection, Image Segmentation
TL;DR: We introduce Aggregated Attention and Convolutional GLU to construct a new backbone, named TransNeXt, achieving state-of-the-art performance in various tasks, including image classification, object detection, and semantic segmentation.
Abstract: In the design of previous Vision Transformers (ViTs), different token mixers were often alternately stacked to balance the visual model’s aggregation of global and local information, or to combine the characteristics of convolution with attention mechanism. In this paper, we propose Aggregated Attention, which is a biomimetic design-based token mixer enabling each token to have fine-grained attention to its nearest neighbor features and coarse-grained attention to global features in terms of spatial information aggregation. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. All of these improvements can be achieved within a single attention layer, eliminating the need for alternately stacking different token mixers. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0\%, surpassing ConvNeXt-B with 69\% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2\% and an ImageNet-A accuracy of 61.6\% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.
Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4422
Loading