Keywords: Self-supervised Learning, Representation Learning
Abstract: Self-supervised vision encoders have become critical components of modern machine learning systems. Despite remarkable advances in image understanding, generation, and multimodal alignment, the underlying representation of visual features has remained largely unchanged, constrained by historical architectures and benchmarks. This reliance on dense feature grids introduces redundancy and limits the integration of understanding and generation. We propose a novel framework that represents images with a small number of sparse tokens in the form of low-rank matrix factorization. While mathematically simple, this formulation effectively disentangles semantic and spatial information. We demonstrate that vision-only self-supervised learning under this framework yields sparse token representations that simultaneously support high-quality image understanding, detailed pixel-level reconstruction, and fine-grained semantic understanding. Together, these results highlight sparse tokens as a promising alternative to dense grids for efficient and versatile visual representation learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10670
Loading