Zoom-Zero: Coarse-to-Fine Video Understanding with Token-Selective Optimization

TMLR Paper9791 Authors

16 Jun 2026 (modified: 19 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present \textbf{Zoom-Zero}, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with \textit{two distinct contributions} beyond prior work: \textbf{(i)} frame saliency self-verification, which validates the fidelity of temporal grounding predictions via fine-grained visual checks on the grounded frames; \textbf{(ii)} token-selective credit assignment, which attributes credit to the tokens responsible for temporal localization or answer generation, mitigating GRPO’s issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks. Our code will be publicly available.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xuming_Hu1
Submission Number: 9791
Loading