CausalVTG: Towards Robust Video Temporal Grounding via Causal Inference

Qiyi Wang; Senda Chen; Ying Shen

CausalVTG: Towards Robust Video Temporal Grounding via Causal Inference

Qiyi Wang, Senda Chen, Ying Shen

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Temporal Grounding, Causal Inference, Vision-Language Understanding

TL;DR: We propose a causal framework for video temporal grounding that mitigates confounding biases and improves robustness to linguistic variations and irrelevant queries.

Abstract: Video Temporal Grounding (VTG) aims to localize relevant segments in untrimmed videos based on natural language queries and has seen notable progress in recent years. However, most existing methods suffer from two critical limitations. First, they are prone to learning superficial co-occurrence patterns—such as associating specific objects or phrases with certain events—induced by dataset biases, which ultimately degrades their semantic understanding abilities. Second, they typically assume that relevant segments always exist in the video, an assumption misaligned with real-world scenarios where queried content may be absent. Fortunately, causal inference offers a natural solution to the above-mentioned issues by disentangling dataset-induced biases and enabling counterfactual reasoning about query relevance. To this end, we propose CausalVTG, a novel framework that explicitly integrates causal reasoning into VTG. Specifically, we introduce a causality-aware disentangled encoder (CADE) based on front-door adjustment to mitigate confounding biases in visual and textual modalities. To better capture temporal granularity, we design a multi-scale temporal perception module (MSTP) that reconstructs query-conditioned video features at multiple resolutions. Additionally, a counterfactual contrastive learning objective is employed to help the model discern whether a query is truly grounded in a video. Extensive experiments on five widely-used benchmarks demonstrate that CausalVTG outperforms state-of-the-art methods, achieving higher localization precision under stricter IoU thresholds and more accurately identifying whether a query is truly grounded in the video. These results demonstrate both the effectiveness and generalizability of proposed CausalVTG.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 16006

Loading