Learning Sparse Visual Representations via Spatial-Semantic Factorization

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Learning, Representation Learning
Abstract: Self-supervised learning in vision is split between methods that learn strong semantics through invariance and methods that preserve spatial detail through reconstruction. We argue that this tension is largely induced by the dense grid representation itself. We propose STELLAR, which factorizes an image representation into sparse semantic tokens and their spatial assignment map. This separation lets the semantic tokens remain view-invariant while the localization matrix absorbs spatial equivariance, enabling both semantic alignment and image reconstruction within one latent space. STELLAR learns the factorized representation with low-rank reconstruction, online concept clustering, and optimal-transport alignment across views. With only 16 tokens, STELLAR achieves strong reconstruction and semantic quality simultaneously, including 2.60 FID and 79.10% ImageNet linear accuracy. The results show that sparse factorized latents can bridge discriminative and generative visual representation learning.
Submission Number: 33
Loading