Keywords: Soft Equivariance, Self-Supervised Learning, Invariant Representation, Vision Transformer
Abstract: A central principle in self-supervised learning (SSL) is to learn data representations that are invariant to semantic-preserving transformations—e.g., image representations should remain unchanged under augmentations like cropping or color jitter. While effective for classification, such invariance can suppress transformation-relevant information that is valuable for other tasks. To address this, recent works explore equivariant representation learning, which encourages representations to retain information about the applied transformations. However, existing approaches have yet to demonstrate scalability in large-scale pre-training settings, e.g., ImageNet. We conjecture that enforcing invariance and equivariance to the same layer is inherently difficult and, if handled naively, may even hinder learning. To overcome this, we propose a simple yet scalable method that decouples the two objectives: learning invariant representations via standard SSL, while softly regularizing intermediate features with an equivariance loss. Our approach necessitates neither a transformation label nor its predictive objectives, but operates directly with group actions applied to the intermediate feature maps. We show that this soft equivariance regularization significantly improves the generalization performance of ImageNet-1k pre-training of vision transformers (ViT), leading to stronger downstream classification accuracy in ImageNet and in its variants, including both natural distributions and broad types of common corruptions and perturbations—ImageNet-C and ImageNet-P. Our code is available at https://anonymous.4open.science/r/erl-B5CE.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15312
Loading