Escaping the Big Data Paradigm in Self-Supervised Representation Learning

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning, self-supervised learning, data efficiency, computer vision, SCOTT, MIM-JEPA, Joint-Embedding Predictive Architecture, Masked Image Modeling
TL;DR: We introduce SCOTT and MIM-JEPA, enabling Vision Transformers to be trained from scratch on small datasets—significantly outperforming supervised methods and challenging the big data paradigm in self-supervised learning.
Abstract:

The reliance on large-scale datasets and extensive computational resources has become a significant barrier to advancing representation learning from images, particularly in domains where data is scarce or expensive to obtain. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a simple tokenization architecture that injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimens while remaining compatible with Masked Image Modeling (MIM) tasks. Alongside, we propose MIM-JEPA, a Joint-Embedding Predictive Architecture within a MIM framework, operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, high-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, our frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with state-of-the-art approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning, offering a new pathway toward more accessible and inclusive advancements in the field.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9629
Loading