What You Say is What You See: Anchoring Visual Token Pruning on Textual Essentials for Efficient LVLM Inference

What You Say is What You See: Anchoring Visual Token Pruning on Textual Essentials for Efficient LVLM Inference

ACL ARR 2026 January Submission1184 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LVLMs, Token Pruning, Efficient Inference

Abstract: Processing lengthy sequences of visual tokens incurs substantial computational overhead, presenting a critical bottleneck for Large Vision-Language Models (LVLMs). Existing token pruning methods face a fundamental dilemma: text-agnostic approaches ignore user instructions, while attention-based techniques suffer from text-visual semantic misalignment, where cross-attention maps fail to reliably localize query-relevant regions. To overcome these limitations, we introduce TextScythe, a novel plug-and-play framework that reframes compression as instruction distillation. Our core insight is to first distill the user instruction into a minimal set of vision-critical text tokens using a novel Entropy-Ratio (ER) metric, which quantifies the specificity and salience of cross-modal semantic correspondence. These distilled tokens then serve as precise anchors to select semantically relevant visual patches, after which a diversity-preserving mechanism supplements representative background tokens to maintain global context. This ``understand-then-prune'' paradigm ensures accurate alignment with user intent while effectively suppressing visual noise. Extensive experiments on 12 image and video benchmarks demonstrate that TextScythe achieves highly efficient compression, retaining 96.6\% of the original accuracy while pruning up to 88.9\% of visual tokens for LLaVA-1.5. The framework shows robust generalization across diverse VLM architectures and high-resolution settings, offering a practical acceleration solution without any modification to the transformer internals.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LVLMs, Token Pruning, Efficient Inference

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 1184

Loading