A Probabilistic Model behind Self- Supervised Learning

TMLR Paper2786 Authors

01 Jun 2024 (modified: 23 Jul 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. A common task is to classify augmentations or different modalities of the data, which share semantic _content_ (e.g. an object in an image) but differ in _style_ (e.g. the object's location). Many approaches to self-supervised learning have been proposed, e.g. SimCLR, CLIP and VicREG, which have recently gained much attention for their representations achieving downstream performance comparable to supervised learning. However, a theoretical understanding of the mechanism behind self-supervised methods eludes. Addressing this, we present a generative latent variable model for self-supervised learning and show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations, providing a unifying theoretical framework for these methods. The proposed model also justifies connections drawn to mutual information and the use of a ``projection head''. Learning representations by fitting the model generatively (termed SimVAE) improves performance over discriminative and other VAE-based methods on simple image benchmarks and significantly narrows the gap between generative and discriminative representation learning in more complex settings. Importantly, as our analysis predicts, SimVAE outperforms self-supervised learning where style information is required, taking an important step toward understanding self-supervised methods and achieving task-agnostic representations.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Changes, highlighted in red, are in response to official reviews.
Assigned Action Editor: ~Sanghyuk_Chun1
Submission Number: 2786
Loading