AdaptiVision: A Flexible and Efficient Vision Transformer for Adaptive Token Pruning

AdaptiVision: A Flexible and Efficient Vision Transformer for Adaptive Token Pruning

ICLR 2026 Conference Submission18834 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer, Token Pruning, Adaptive Clustering, Adaptive Visual Representation

Abstract: Transformer-based architectures have recently demonstrated remarkable performance in various vision tasks by capturing global contextual relationships through self-attention. However, this success comes at a high computational cost, as the self-attention mechanism scales quadratically with the number of visual tokens, limiting its scalability to high-resolution inputs and real-time applications. Although several recent efforts have aimed to reduce this complexity via token pruning or condensation, these methods often rely on heuristic importance scores or non-differentiable selection strategies, which can lead to suboptimal performance and lack of generalizability across tasks. To address these limitations, we propose AdaptiVision, a flexible and efficient Vision Transformer architecture designed to dynamically adapt the token set throughout the network. At the core of AdaptiVision is a differentiable token condensation module based on clustering, which groups semantically similar tokens and allows the model to retain only the most informative and representative ones while discarding redundancies. To guide this condensation process, we introduce a semantic guidance mechanism that incorporates auxiliary semantic signals (such as saliency or label-based cues) to preserve task-relevant structures during token reduction. Furthermore, we design auxiliary consistency and stability objectives that promote coherent token clustering across layers and inputs, enabling better generalization and robustness without sacrificing performance. We conduct extensive experiments across multiple challenging benchmarks to validate the effectiveness of our model. Notably, on the ImageNet-1K dataset, our proposed AdaptiVision achieves the highest Top-1 accuracy (79.8\%) among comparable vision transformers while substantially reducing the number of parameters and FLOPs, demonstrating superior accuracy-efficiency trade-offs.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 18834

Loading