CACR: Reinforcing Temporal Answer Grounding in Videos via Candidate-Aware  Causal Reasoning

Muge Qi; Rong Fu; Pengbin Feng; Xianda Li; Yu Cai; Yifu Guo; Shizhe Zhang; Simon James Fong; Bin Li; Lei Ma

CACR: Reinforcing Temporal Answer Grounding in Videos via Candidate-Aware Causal Reasoning

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai, Yifu Guo, Shizhe Zhang, Simon James Fong, Bin Li, Lei Ma

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Temporal Answer Grounding in Instructional Video

TL;DR: Temporal Answer Grounding in Instructional Video

Abstract: The growing need for direct answer retrieval from videos underscores the importance of Temporal Answer Grounding in Videos (TAGV)—the task of localizing the specific video segment that answers a natural language query. TAGV remains challenging, as it requires understanding semantically complex questions and handling the extreme length disparity between untrimmed videos and short answer segments. Current methods often underperform due to sensitivity to redundant content or limited visual reasoning ability. To overcome these issues, we introduce a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-trained (VLP) model to efficiently generate K candidate segments, then applies a temporal logic reasoning module strengthened by a rejection reward mechanism and optimized through Generalized Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on four benchmarks show that our method achieves state-of-the-art performance in mean Intersection-over-Union (mIoU), offering a new direction for reasoning-based retrieval in long videos.We also publish our code at:https://github.com/anonymous1118-10/opencode-CACR

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4843

Loading