ZoomV: Temporal Zoom-in for Efficient Long Video Understanding

ICLR 2026 Conference Submission3376 Authors

09 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long video understanding, Zoom-in search, Large video language model
Abstract: Long video understanding poses a fundamental challenge for large video-language models (LVLMs) due to the overwhelming number of frames and the risk of losing essential context through naive downsampling. Inspired by the way humans watch videos on mobile phones, constantly zooming in on frames of interest, we propose $\textbf{ZoomV}$, a query-aware temporal zoom-in framework designed for efficient and accurate long video understanding. Specifically, ZoomV operates in three stages: (1) Temporal interests grounding: guided by the query, ZoomV retrieves relevant events and their associated temporal windows as candidates. (2) Event interests spotlighting: within pools of candidate windows, each window is scored through the model itself reflection and filtered accordingly, where higher-confidence windows are more representative. (3) Compact representation: the selected events are encoded and temporally downsampled to preserve critical semantics while significantly reducing redundancy. Extensive experiments demonstrate that ZoomV substantially outperforms prior $\textbf{video-agent–style}$ approaches. On temporal grounding, ZoomV unlocks the latent capability of LVLMs, achieving an $\textbf{11.8\\%}$ $\textbf{mIoU}$ gain on Charades-STA. Remarkably, ZoomV further boosts accuracy on LVBench by $\textbf{9.7\\%}$, underscoring its effectiveness on long-video benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3376
Loading