Keywords: self-supervised learning, contrastive learning, representation learning, theoretical understanding
TL;DR: We propose a latent variable model that unifies several self-supervised learning methods (e.g. SimCLR) and use it to learn representations that narrow the gap between generative and discriminative self-supervised learning.
Abstract: Self-supervised representation learning is a powerful paradigm that leverages the relationship between semantically similar data, such as augmentations, extracts of an image or sound clip, or multiple views/modalities. Recent methods, e.g. SimCLR, CLIP and DINO, have made significant strides, yielding representations that achieve state-of-the-art results on multiple downstream tasks. A number of self-supervised discriminative approaches have been proposed, e.g. instance discrimination, latent clustering and contrastive methods.
Though often intuitive, a comprehensive theoretical understanding of their underlying mechanisms or *what* they learn eludes.
Meanwhile, generative approaches, such as variational autoencoders (VAEs), fit a specific latent variable model and have principled appeal, but lag significantly in terms of performance. We present a theoretical analysis of self-supervised discriminative methods and a graphical model that reflects the assumptions they implicitly make and unifies these methods. We show that fitting this model under an ELBO objective improves representations over previous VAE methods on several common benchmarks, narrowing the gap to discriminative methods, and can also preserve information lost by discriminative approaches. This work brings new theoretical insight to modern machine learning practice.
Submission Number: 88
Loading