ADATOK: ADAPTIVE VISUAL REPRESENTATION WITH QUALITY-PRESERVING DYNAMIC TOKENS

ICLR 2026 Conference Submission16452 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Representation, Vision-Language Models
TL;DR: Leveraging information bottleneck theory, this paper proposes AdaTok, a framework that autonomously determines the optimal number of visual tokens for each image, enhancing representational efficiency.
Abstract: Image tokenization, a cornerstone of modern visual representation, faces a fundamental dilemma posed by content diversity. A fixed number of tokens is inherently suboptimal, causing computational redundancy for simple images and risking information loss for complex ones. While variable-length methods offer a potential solution, they are typically empirical and heuristic, lacking a theoretical mechanism for adaptation. To address this dilemma, we propose \textbf{AdaTok}, an adaptive visual representation framework with high flexibility for diverse representational needs. Specifically, it incorporates an elastic encoder capable of encoding an image into an arbitrary number of tokens. Building on this, we design a novel token selection strategy: guided by the information bottleneck principle, it enables the model to learn a policy that maximizes representational information under a minimal budget. This allows AdaTok to autonomously find a sufficient yet compact token set for each image. Extensive experiments demonstrate that this elastic, sample-level tokenization yields superior performance in both image reconstruction and generation. By preserving essential details while minimizing redundancy, AdaTok not only enhances efficiency but also creates a more natural alignment with the variable-length structure of language, paving the way for more unified and efficient vision-language models (VLMs).
Primary Area: generative models
Submission Number: 16452
Loading