A Close Look at Token Mixer: From Attention to ConvolutionDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Convolution, Attention, Visual Representation
TL;DR: We take a close look at two classical token-mixers, convolution and attention. Detailed comparison and visual analysis motivate us to present a novel fully convolutional vision transformer, which achieves promising performance on several benchmarks.
Abstract: There is an increasingly intensive debate about the effectiveness of ConvNets and Transformers in vision fields. Originating from the language processing community, Transformers show great promise for many vision tasks due to the insightful architecture design and attention mechanism. Nevertheless, we witnessed the revenge of ConvNets soon, surpassing Transformer variants in mainstream vision tasks. In this paper, we are not engaging in this debate; instead, we look into the details of attention and convolution. By looking into the self-attention responses in Transformers, we empirically find that 1.) Vision Transformers present a query-irrelevant behavior in deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). This phenomenon indicates that a global context may hide behind the self-attention mechanism. 2.) The attention maps are intrinsically sparse; introducing the knowledge from ConvNets would largely smooth the attention and improve the performance. Motivated by these, we generalize self-attention formulation to abstract the query-irrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and shortand long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top-1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViTbased models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and pretrained models are available at:https://anonymous.4open.science/r/FCViT-pytorch.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
9 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview