Consistent Region-Informed Self-supervised Pretraining

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deep Learning, Self-supervised learning (SSL), Universal visual representations, Dense prediction
TL;DR: CRISP is a self-supervised framework that enforces region-level consistency across views, yielding semantically rich and spatially precise representations for both dense prediction and global tasks.
Abstract: Dense prediction tasks such as semantic segmentation require representations that capture both global semantics and local structure. Most self-supervised learning methods prioritise image-level invariance, producing strong features for classification but offering limited guidance for tasks requiring (or depending on) spatial coherence. In parallel, several approaches have been proposed specifically for dense prediction, but their improvements in local fidelity often come at the cost of weaker global transfer. We present CRISP (Consistent Region-Informed Self-Supervised Pretraining), a framework that enhances patch-level learning with explicit region-level alignment. CRISP discovers coherent regions in a reference image, projects them to augmented views via geometric correspondences, and aggregates their patch features into concept tokens with a mask-guided module. By enforcing consistency at the region, patch, and global levels, CRISP learns representations that are both semantically strong and spatially coherent. Pretraining on ImageNet-1K shows that CRISP achieves substantial gains on dense prediction benchmarks while maintaining strong performance on global benchmarks. These results establish region-level consistency as a critical ingredient for advancing universal visual representations.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11630
Loading