Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Shashank Shekhar; Florian Bordes; Pascal Vincent; Ari S. Morcos

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Shashank Shekhar, Florian Bordes, Pascal Vincent, Ari S. Morcos

Published: 04 Mar 2023, Last Modified: 16 May 2023ME-FoMo 2023 SpotlightReaders: Everyone

Keywords: Vision Transformer, ViT, Self Supervised Learning, Joint Embedding Learning, Reconstruction Based Learning, Contrastive Modelling, Masked Image Modelling, Representation Similarity

TL;DR: We study the differences between joint embedding (MoCo, DINO) and reconstruction-based (MAE) self-supervised learning of vision transformers (ViT), and show how fine-tuning changes representations.

Abstract: Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of their representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network, primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that it re-organizes the information to be more similar to pre-trained joint embedding models.

0 Replies

Loading