All You Need Are Random Visual Tokens?  Demystifying Token Pruning in VLLMs

Yahong Wang; Juncheng Wu; Zhangkai Ni; Longzhen Yang; Yihang Liu; Chengmei YANG; Ying Wen; Lianghua He; Xianfeng Tang; Hui Liu; Yuyin Zhou

All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei YANG, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou

15 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token Pruning, Vision-Language Models

Abstract: Vision Large Language Models (VLLMs) usually incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (_e.g._, beyond the 20th), existing training-free pruning methods _perform no better than random pruning_. We hypothesize that this degradation is caused by **"vanishing token information''**, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we formally quantify a token's information content by measuring the perturbation to the model's output probability upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon'', beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (_e.g._, Qwen2.5-VL) make more effective use of deeper visual tokens compared with weaker models (_e.g._, LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods across various models and benchmarks, with improvements up to 6.7% on LLaVA-1.5-7B. Using DART with random pruning achieves state-of-the-art results, maintaining 93.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5964

Loading