Self-Supervised Learning with the Matching Gap

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: optimal transport, self-supervised learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A new loss for SSL that leverages optimal matchings to learn representations; easy to compute using the Sinkhorn algorithm, without having to differentiate Sinkhorn iterations, using Danskin's theorem.
Abstract: Contrastive learning (CL) is a fundamental paradigm in self-supervised learning. CL methods rely on a loss that nudges the features of various views from one image to stay closer, while pulling away those drawn from different images. Such a loss favors invariance: feature representations of the same perturbed image should collapse to the same vector, while remaining far enough from those of any other image. Although intuitive, CL leaves room for trivial solutions, and has a documented propensity to collapse representations for very different images. This is often mitigated by using a very large variety of augmentations. In this work, we address this tension by introducing a different loss, the matching gap. Given a set of $n$ images transformed in two different ways, the matching gap is the difference between the mean cost (e.g. a squared distance), in representation space, of the $n$ paired images, and the optimal matching cost obtained by running an optimal matching solver across these two families of $n$ images. The matching gap naturally mitigates the problem of data augmentation invariance, since it can be zero without requiring features from the same image to collapse. We implement the matching gap using the Sinkhorn algorithm and show that it can be easily differentiated using Danskin’s theorem. In practice, we show that we can learn competitive features, even without extensive data augmentations: Using only cropping and flipping, we achieve 74.2% top-1 accuracy with a ViT-B/16 on ImageNet-1k, to be compared to 72.9% for I-JEPA (Assran et al., 2023).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8379
Loading