CG-SSL: Concept-Guided Self-Supervised Learning

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning, Curriculum Learning, Self-supervised Learning, Vision Foundation Models, Vision Transformers
TL;DR: We introduce CG-SSL, a concept-guided self-supervised learning framework that aligns meaningful image regions across views, achieving state-of-the-art performance on dense prediction tasks.
Abstract: Humans understand visual scenes by first capturing a global impression and then refining this understanding into distinct, object-like components. Inspired by this process, we introduce \textbf{C}oncept-\textbf{G}uided \textbf{S}elf-\textbf{S}upervised \textbf{L}earning (CG-SSL), a novel framework that brings structure and interpretability to representation learning through a curriculum of three training phases: (1) global scene encoding, (2) discovery of visual concepts via tokenised cross-attention, and (3) alignment of these concepts across views. Unlike traditional SSL methods, which simply enforce similarity between multiple augmented views of the same image, CG-SSL accounts for the fact that these views may highlight different parts of an object or scene. To address this, our method establishes explicit correspondences between views and aligns the representations of meaningful image regions. At its core, CG-SSL augments standard SSL with a lightweight decoder that learns and refines concept tokens via cross-attention with patch features. The concept tokens are trained using masked concept distillation and a feature-space reconstruction objective. A final alignment stage enforces view consistency by geometrically matching concept regions under heavy augmentation, enabling more compact, robust, and disentangled representations of scene regions. Across multiple backbone sizes, CG-SSL achieves state-of-the-art results on image segmentation benchmarks using $k$-NN and linear probes, substantially outperforming prior methods and approaching, or even surpassing, the performance of leading SSL models trained on over $100\times$ more data. Code and pretrained models will be released.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 12830
Loading