Keywords: representation learning, self supervised learning
Abstract: Inspired by Darcet et al. (2024) where extra tokens (or registers) are introduced to offset the artifacts in feature maps due to high-norm tokens, this paper presses further and asks a more challenging question: Can we find a suitable regularization term such that the extra tokens can evolve into disentangled representations, capable of attending to finer details of objects (e.g., parts)? We propose XTRA, an intuitive yet powerful framework that augments Vision Transformers with dedicated ``factor tokens'' and enforces disentanglement via a novel Minimum Volume Constraint (MVC). A multi-stage aggregation process further confines these factor tokens into semantically pure components, especially when the amount of hyperparameters is large. On ImageNet-1K, XTRA boosts KNN accuracy by 5.8\% and linear-probe accuracy by 2.3\% over leading self-supervised learning (SSL) baselines, outperforming even models trained on larger datasets.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14442
Loading