DTR: Towards optimal token compression with data-driven token ranking for efficient visual-language model inference

DTR: Towards optimal token compression with data-driven token ranking for efficient visual-language model inference

ICLR 2026 Conference Submission10510 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token compression, data-driven token ranking, vision-language model, inference acceleration

TL;DR: Toward optimal VLM token compression with a data-driven method instead of existing model-driven methods (e.g., attention scores).

Abstract: Token compression is crucial for vision-language models (VLMs) inference due to its tremendous computational complexity. Although substantial works with various model-driven methods have been done to mine importance rankings among tokens for compression~(e.g., rank according to attention scores or matrix ranks), they are all constrained by one-sided handcrafted information, thus being trapped in local optimum. To utilize comprehensive information for global optimum, we present a Data-driven Token Ranking (DTR) framework, which trains a plug-and-play token-ranking model with self-gathered token-ranking data for VLM token compression at runtime. Specifically, first, we propose a dataset construction method to efficiently gather importance rankings of tokens based on original VLM datasets. Then we present a training method to build a token-ranking model for predicting a ranked-list of token importance based on input vision and text tokens. Finally, the ranking model can be plugged in the model, then filter tokens with an user-defined token number at runtime for acceleration. Extensive experimental results across 8 mainstream benchmarks show that DTR achieves the state-of-the-art token compression performance compared with 8 challenging comparatives. Moreover, a comprehensive analysis shows that DTR as well as data-driven methods possess tremendous potential, which can comprehensively outperform the vanilla VLM with much fewer tokens.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 10510

Loading