DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models

DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models

ICLR 2026 Conference Submission16297 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tokenization, Vision Fundation Models

TL;DR: We introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART) module that adaptively partitions the image by combining learnable region scores with piecewise differentiable quantile operations.

Abstract: The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16297

Loading