Keywords: Model Compression, Large Language Models, Structured Pruning
TL;DR: We propose a training-free and model-agnostic token pruning framework for vision–language models that delivers up to x1.4 faster inference with <1% accuracy loss.
Abstract: Transformer-based vision–language models (VLMs) have achieved state-of-the-
art performance across a wide range of multimodal tasks, yet their high inference
cost remains a major obstacle to scalability. We address the fundamental chal-
lenge of efficiently identifying the most informative visual tokens in VLMs—a
key bottleneck for large-batch and long-sequence inference. Existing methods of-
ten rely on exhaustive or heuristic search strategies that become prohibitively slow
or memory-intensive at deployment scale. We introduce Global-Local Diver-
sity Selection (GLDS), a training-free, model-agnostic framework that performs
computationally efficient token selection while explicitly balancing local impor-
tance with global coverage. To further enhance representational quality under ag-
gressive pruning, GLDS incorporates a determinantal point process (DPP)–based
diversity mechanism, ensuring that the retained subset captures both spatially
and semantically diverse regions. This leads to consistent improvements across
batch sizes and sequence lengths. GLDS accelerates both the prefill and decoding
stages, achieving up to x1.75 speedup in prefill and x1.40 in decoding, while
scaling to inference regimes that overwhelm conventional approaches. On image
understanding benchmarks, it maintains performance with less than 1% absolute
accuracy loss. To our knowledge, this is the first principled and scalable token-
selection strategy to achieve a favorable efficiency–accuracy trade-off in VLMs,
paving the way for practical deployment of accelerated multimodal transformers.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19452
Loading