SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs

Chong Tang; Sannara Ek; Dirk Koch; Robert D. Mullins; Alex S. Weddell; Jagmohan Chauhan

SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs

Chong Tang, Sannara Ek, Dirk Koch, Robert D. Mullins, Alex S. Weddell, Jagmohan Chauhan

Published: 26 Jan 2026, Last Modified: 21 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Video Understanding, Vision-Language Models, Token Pruning, Redundancy Reduction, Predictive Coding

Abstract: Videos contain rich information but also high redundancy, as consecutive frames often share similar backgrounds and predictable motions. Current video-language models (VLMs) are unable to exploit this redundancy and therefore perform a significant amount of superfluous computation, processing thousands of patch tokens even when little new information is present. What is missing is an on-the-fly, model-agnostic signal of temporal predictability to decide whether tokens carry unpredictable information that merits computation. We propose SURGE, a training-free and backbone-agnostic method that measures surprise in token space. Surprise scores are defined by the prediction error of each token from its recent history; high-surprise tokens are retained, while predictable ones are pruned. Aggregating scores over time produces a surprise curve that highlights key events, which can be further refined with CLIP-based query relevance to form a compact spatio-temporal mask. Experiments on multiple video understanding benchmarks show that SURGE reduces tokens by up to 7$\times$ and prefill cost by 86–98\%, while maintaining accuracy within $\pm$1 point of full-token baselines. By aligning computation with novelty, SURGE enables video VLMs to handle long contexts efficiently and without retraining.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 593

Loading