The Extra Token Matters: Disentangled Representation Learning with Vision Transformers

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: representation learning, self supervised learning
Abstract: Inspired by ViT-Register, where extra tokens (or registers) are introduced to offset the artifacts in feature maps due to high-norm tokens, this paper presses further and asks a more challenging question: Can we find a suitable regularization term such that the extra tokens can evolve into disentangled representations, capable of attending to finer details of objects (e.g., parts)? Successfully addressing this challenge can pave the way to the realization of controllable generation tasks. We propose XTRA, an intuitive yet powerful framework that augments Vision Transformers with dedicated ``factor tokens'' and enforces disentanglement via a novel Minimum Volume Constraint (MVC). A multi-stage aggregation process, inspired by GroupViT, further refines these factor tokens into semantically pure components, \added{preventing token collapse that often occurs when training with MVC alone}. \added{On ImageNet-1K, XTRA achieves superior disentanglement (8.4× improvement in SEPIN@1 over DINOv2) while simultaneously improving representation quality:} KNN accuracy improves by 5.8\% and linear-probe accuracy by 2.3\%.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14442
Loading