ReMixer: Object-aware Mixing Layer for Vision Transformers and MixersDownload PDF

Published: 25 Mar 2022, Last Modified: 05 May 2023ICLR2022 OSC PosterReaders: Everyone
Keywords: vision transformers, patch-based models, object-centric learning
TL;DR: We propose an object-aware mixing layer for patch-based models.
Abstract: Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, exceeding classic convolutional networks. While the initial patch-based models treated all patches equally, recent studies reveal that incorporating inductive biases like spatiality benefits the learned representations. However, most prior works solely focused on the position of patches, overlooking the scene structure of images. This paper aims to further guide the interaction of patches using the object information. Specifically, we propose ReMixer, which reweights the patch mixing layers based on the patch-wise object labels extracted from pretrained saliency or classification models. We apply ReMixer on various patch-based models using different patch mixing layers: ViT, MLP-Mixer, and ConvMixer, where our method consistently improves the classification accuracy and background robustness of baseline models.
3 Replies