GTA: Guided Transfer of Spatial Attention from Self-supervised Models

Published: 31 Jul 2023, Last Modified: 02 Aug 2023ICCV 2023 Workshop VIPriors Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Vision Transformer, transfer learning, fine-grained visual classification, spatial attention, self-supervised learning
TL;DR: GTA improves transfer learning performance for ViT using spatial attention from well-trained source model.
Abstract: Recently, self-supervised learning has enabled the pre-training of vision transformers (ViT) using vast amounts of unlabeled data to obtain rich representations. Using well-trained representations in transfer learning can lead to better performance and faster convergence compared to training from scratch. However, even if such good representations are transferred, a model can easily overfit the limited training dataset and lose the characteristics of the transferred representations. This phenomenon is more severe in ViT, which has low inductive bias. Through experimental analysis using attention maps in ViT, we observe that the rich representations deteriorate when trained on a small dataset. Motivated by this finding, we propose a novel and simple regularization method for ViT called guided transfer of spatial attention (GTA). Our proposed method regularizes the self-attention maps between source and target models. Through this explicit regularization, a target model can fully exploit the knowledge related to object localization properties. Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small. As far as we know, there has been no previous study to improve transfer learning performance, specifically considering the ViT architecture.
Supplementary Material: zip
Submission Number: 15
Loading