Keywords: Vision-Language Model, Visual Token Compression, Inference Acceleration
TL;DR: TokenNMS is a training-free, two-stage VLM pruning framework. It uses feature-space NMS to resolve Top-K's spatial bias and redundancy, effectively decoupling spatial diversity from semantic alignment.
Abstract: Vision-Language Models (VLMs) face massive inference overhead from extensive visual tokens. Existing Top-$K$ pruning methods mitigate this but suffer from severe spatial bias, information redundancy, and crucial context loss. To address this, we propose TokenNMS, a training-free two-stage framework that reframes token reduction as deterministic feature-space Non-Maximum Suppression (NMS). TokenNMS seamlessly bridges query-agnostic spatial pruning with query-aware semantic filtering, enforcing similarity constraints to penalize semantic overlap. Extensive experiments demonstrate our approach effectively preserves spatially diverse representations while accelerating inference across diverse VLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 44
Loading