Self-supervised Sparse Vision Concepts for Image Understanding and Reconstruction

Theodore Zhao; Sid Kiblawi; Jianwei Yang; Naoto Usuyama; Reuben Tan; Noel C Codella; Tristan Naumann; Hoifung Poon; Mu Wei

Self-supervised Sparse Vision Concepts for Image Understanding and Reconstruction

Theodore Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-supervised Learning, Representation Learning

Abstract: Self-supervised vision encoders have become critical components of modern machine learning systems. Despite remarkable advances in image understanding, generation, and multimodal alignment, the underlying representation of visual features has remained largely unchanged, constrained by historical architectures and benchmarks. This reliance on dense feature grids introduces redundancy and limits the integration of understanding and generation. We propose a novel framework that represents images with a small number of sparse tokens in the form of low-rank matrix factorization. While mathematically simple, this formulation effectively disentangles semantic and spatial information. We demonstrate that vision-only self-supervised learning under this framework yields sparse token representations that simultaneously support high-quality image understanding, detailed pixel-level reconstruction, and fine-grained semantic understanding. Together, these results highlight sparse tokens as a promising alternative to dense grids for efficient and versatile visual representation learning.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 10670

Loading