The Extra Token Matters: Disentangled Representation Learning with Vision Transformers

The Extra Token Matters: Disentangled Representation Learning with Vision Transformers

ICLR 2026 Conference Submission14442 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: representation learning, self supervised learning

Abstract: Inspired by Darcet et al. (2024) where extra tokens (or registers) are introduced to offset the artifacts in feature maps due to high-norm tokens, this paper presses further and asks a more challenging question: Can we find a suitable regularization term such that the extra tokens can evolve into disentangled representations, capable of attending to finer details of objects (e.g., parts)? We propose XTRA, an intuitive yet powerful framework that augments Vision Transformers with dedicated ``factor tokens'' and enforces disentanglement via a novel Minimum Volume Constraint (MVC). A multi-stage aggregation process further confines these factor tokens into semantically pure components, especially when the amount of hyperparameters is large. On ImageNet-1K, XTRA boosts KNN accuracy by 5.8\% and linear-probe accuracy by 2.3\% over leading self-supervised learning (SSL) baselines, outperforming even models trained on larger datasets.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 14442

Loading