PASS: Patch-Aware Self-Supervision for Vision TransformerDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Self-supervised learning, Vision Transformer, patch-level representations
Abstract: Recent self-supervised representation learning methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for self-supervised learning do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an input image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Patch-Aware Self-Supervision (PASS), for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with PASS produces more semantically meaningful attention maps patch-wisely in an unsupervised manner, which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite the simplicity of our scheme, we demonstrate that it can significantly improve the performance of existing self-supervised learning methods for various visual tasks, including object detection and semantic segmentation.
One-sentence Summary: We propose patch-aware self-supervision (PASS) for learning better patch-level representations of Vision Transformer.
Supplementary Material: zip
7 Replies

Loading