CACR: Reinforcing Temporal Answer Grounding in Videos via Candidate-Aware Causal Reasoning

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Temporal Answer Grounding in Instructional Video
TL;DR: Temporal Answer Grounding in Instructional Video
Abstract: The growing need for direct answer retrieval from videos underscores the importance of Temporal Answer Grounding in Videos (TAGV)—the task of localizing the specific video segment that answers a natural language query. TAGV remains challenging, as it requires understanding semantically complex questions and handling the extreme length disparity between untrimmed videos and short answer segments. Current methods often underperform due to sensitivity to redundant content or limited visual reasoning ability. To overcome these issues, we introduce a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-trained (VLP) model to efficiently generate K candidate segments, then applies a temporal logic reasoning module strengthened by a rejection reward mechanism and optimized through Generalized Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on four benchmarks show that our method achieves state-of-the-art performance in mean Intersection-over-Union (mIoU), offering a new direction for reasoning-based retrieval in long videos.We also publish our code at:https://github.com/anonymous1118-10/opencode-CACR
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4843
Loading