Does MIM cheat? Exploring Semantic Invariance in Self-Supervised ViT Representations

ICLR 2026 Conference Submission13192 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Supervised Learning, Representation Learning, Computer Vision
TL;DR: We analyze embeddings of state-of-the-art ViTs to uncover semantically invariant features for all MIM-based models, and propose a new model agnostic projection for post-hoc denoising of representations.
Abstract: In this work, we analyze patch-level embeddings and show that MIM objectives bias representations toward non-semantic cues, limiting their effectiveness in inference. To probe this effect, we introduce a model-agnostic counterfactual score that quantifies *semantic invariance* by comparing principal component responses to real inputs and noise, providing a novel measure to directly characterize the tradeoff between semantic information and structural noise in ViT embeddings. Building on this measure, we propose Semantic Orthogonal Projection (SOaP), a post-hoc method using simple Gram–Schmidt orthogonalization that suppresses invariant components in patch representations. Our experiments show that SOaP consistently improves performance on multiple downstream tasks across state-of-the-art MIM-based models.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13192
Loading