Entropy-Select: Training-Free Local Entropy Token Compression for Video LLMs

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video-language models; Vision-language models; Token selection; Entropy-guided pruning; Training-free; Video understanding
TL;DR: EntropySelect is a training-free token selector that uses local neighborhood entropy to identify semantically rich video tokens. It works with black-box VLMs using only exported features and enables faster inference through intelligent token pruning.
Abstract: Video Language Models (VLMs) emit visual tokens that grow linearly with video length while attention scales quadratically, making inference sensitive to token count and latency variance. In practice, an effective token compressor should be training-free to plug into strong off-the-shelf VLMs, architecture-agnostic to avoid model-specific hooks, and deliver selection runtime that is predictable regardless of the target retention—properties that many prior methods lack. We introduce EntropySelect, a training-free, architecture-agnostic framework that ranks tokens by local neighborhood entropy, an information-theoretic measure of unpredictability relative to nearby tokens. We compute temperature-scaled similarity distributions within fixed spatial windows to obtain normalized entropy, fuse it with gradient-based structural saliency, enforce coverage via grid quotas, and allocate per-frame budgets by saliency. Because windows/grids are fixed and selection reduces to a single global sort, EntropySelect runs in O(N log N) with latency effectively decoupled from the retention ratio. Across MVBench, ActivityNet-QA, VideoMME, and EgoSchema, EntropySelect matches or exceeds the uncompressed baseline at moderate retention and degrades gracefully under aggressive compression, yielding predictable latency and plug-and-play deployment without any model modification or internal-state access. Notably, we observe “enhancement under compression” on several benchmarks—for example, 35% and 50% retention surpass the full-token baseline on benchmarks such as MVBench and EgoSchema—suggesting that removing low-entropy redundancy can improve downstream accuracy.These results establish local entropy as a principled criterion for video token selection and a practical tool for scalable VLM inference. Code will be released upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4486
Loading