Abstract: Video Temporal Grounding (VTG) localizes moments in untrimmed videos using natural language queries. Most VTG datasets focus on short videos, and existing approaches excel in short-term cross-modal matching but struggle with long VTG, where long-range temporal reasoning is required for complex events. Existing approaches typically output timestamp predictions without intermediate steps, limiting effective reasoning, whereas humans solve this step by step. To address this, we propose a long VTG framework, StepVTG, with multimodal visual and speech inputs, leveraging Large Language Models (LLMs) for step-by-step reasoning. Specifically, we transform task descriptions, speech, and visual inputs into text prompts. To enhance temporal reasoning, we introduce the Boundary-Perceptive Prompting strategy, which includes: i) a multiscale denoising Chain-of-Thought (CoT) combining global and local semantics with noise filtering, ii) validity principles to ensure LLMs generate reasonable, parsable predictions, and iii) one-shot In-Context Learning (ICL) to improve reasoning via imitation. For evaluation, we establish MM-LVTG, a new long VTG benchmark with multimodal inputs, and demonstrate through extensive experiments that StepVTG achieves state-of-the-art performance. It offers explainable reasoning steps for predictions and reveals potential in facilitating video understanding with off-the-shelf LLMs.
External IDs:dblp:conf/icmcs/ChenWCFSJZ25
Loading