GLDS: Global–Local Diversity Selection for Scalable Token Pruning in Vision–Language Models

GLDS: Global–Local Diversity Selection for Scalable Token Pruning in Vision–Language Models

ICLR 2026 Conference Submission19452 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Compression, Large Language Models, Structured Pruning

TL;DR: We propose a training-free and model-agnostic token pruning framework for vision–language models that delivers up to x1.4 faster inference with <1% accuracy loss.

Abstract: Transformer-based vision–language models (VLMs) have achieved state-of-the- art performance across a wide range of multimodal tasks, yet their high inference cost remains a major obstacle to scalability. We address the fundamental chal- lenge of efficiently identifying the most informative visual tokens in VLMs—a key bottleneck for large-batch and long-sequence inference. Existing methods of- ten rely on exhaustive or heuristic search strategies that become prohibitively slow or memory-intensive at deployment scale. We introduce Global-Local Diver- sity Selection (GLDS), a training-free, model-agnostic framework that performs computationally efficient token selection while explicitly balancing local impor- tance with global coverage. To further enhance representational quality under ag- gressive pruning, GLDS incorporates a determinantal point process (DPP)–based diversity mechanism, ensuring that the retained subset captures both spatially and semantically diverse regions. This leads to consistent improvements across batch sizes and sequence lengths. GLDS accelerates both the prefill and decoding stages, achieving up to x1.75 speedup in prefill and x1.40 in decoding, while scaling to inference regimes that overwhelm conventional approaches. On image understanding benchmarks, it maintains performance with less than 1% absolute accuracy loss. To our knowledge, this is the first principled and scalable token- selection strategy to achieve a favorable efficiency–accuracy trade-off in VLMs, paving the way for practical deployment of accelerated multimodal transformers.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19452

Loading