SPLIT-VLM: Salience-Guided Partitioning towards Local Coverage for Importance-Aware Token Dropping in Vision-Language Models
Keywords: Vision-Language Model, Token Dropping, Multimodal Reasoning
Abstract: Large-scale vision–language models (VLMs) excel at multimodal reasoning, yet efficiency collapses when vision tokens—often orders of magnitude more than text—dominate compute and memory. Prior token-reduction strategies typically trade off salience (which is prone to position bias and incurs extra computation) against diversity (which can under-cover salient regions and is sensitive to hyperparameters). We present SPLIT, a theoretically grounded framework that jointly preserves salience and diversity while aggressively eliminating redundancy. SPLIT (i) estimates token importance via temporal shifts of hidden states across layers—eschewing attention scores and their biases; (ii) assigns adaptive region-level budgets to guarantee localized coverage; and (iii) selects tokens using a diversity score that prioritizes distinctive, non-redundant representations. Our analysis shows that adaptive budgeting yields tighter coverage guarantees than uniform allocation, and our selection rule maintains diversity without costly tuning. Empirically, SPLIT consistently outperforms state-of-the-art on image and video understanding benchmarks. On image understanding with LLaVA-1.5-7B, SPLIT preserves over 99\% accuracy with 192 vision tokens and about 92.8\% with only 64 tokens, demonstrating robust performance under severe token budgets. These results indicate that SPLIT delivers scalable, attention-score-free token reduction that makes multimodal reasoning substantially more efficient without sacrificing accuracy.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15400
Loading