Dense Video Understanding with Gated Residual Tokenization

Haichao Zhang; Wenhao Chai; Shwai He; Ang Li; Yun Fu

Dense Video Understanding with Gated Residual Tokenization

Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dense Video Understanding Task, Dense Information Video Evaluation Benchmark, Gated Residual Tokenization

TL;DR: We introduce DIVE, the first benchmark for novel task of dense video understanding, and GRT, a method to efficiently process high-FPS videos by reducing token overhead.

Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and evaluation benchmarks predominantly rely on low-frame-rate sampling, such as uniform sampling or frame selection, which discards dense temporal information. This compromise is primarily made to avoid the high computational cost of tokenizing every frame, which leads to redundant computation during frame-level tokenization and a linear increase in token count as video length grows. Such a trade-off stems from engineering constraints in existing video understanding systems that rely on keyframe-based processing. Yet, for tasks such as lecture or educational video comprehension, where information is distributed across nearly every frame, this compromise becomes a major limitation. These tasks require frame-by-frame reasoning and fine-grained temporal alignment, and current approaches discourage progress on high-frame-rate datasets or models. To address this gap, we introduce the novel task of Dense Video Understanding, which aims to enable video comprehension at high frame rates. Our goal is to reduce the tokenization time of high-FPS videos and minimize the token overhead incurred by dense frame sampling. This lack of dense modeling also affects current benchmarks, whose question-answer pairs are often designed around slowly changing content, making them insufficient for evaluating fine-grained temporal understanding. To this end, we propose the first benchmark specifically tailored for dense video understanding: DIVE (Dense Information Video Evaluation). To overcome inefficiencies in frame-wise tokenization, we propose Gated Residual Tokenization (GRT), a two-stage token acceleration and reduction framework that operates both during and after tokenization, addressing inefficiencies at the inter-tokenization and intra-tokenization levels, respectively: First, Motion-Compensated Inter-Gated Tokenization applies pixel-level motion estimation and a gating mechanism during tokenization to identify and skip static regions, encoding only the moving patches. This results in sub-linear growth in both tokenization time and token count. Second, Semantic-Scene Intra-Tokenization Merging performs content-level token merging across static regions within a scene, further reducing redundancy while preserving dynamic semantic content. Extensive experiments on the DIVE benchmark show that our methods not only outperform larger VLLM baselines but also consistently improve as FPS increases. These results underscore the importance of preserving dense temporal information and demonstrate that GRT enables scalable, efficient high-FPS video understanding.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7552

Loading