Abstract: In this paper, we introduce a novel training-free framework for Video Temporal Grounding (VTG) that combines pre-trained Visual Language Models (VLMs) and Large Language Models (LLMs). Existing methods often struggle with capturing the semantics of natural language queries and identifying the dynamic transitions at event boundaries. To address these challenges, our approach uses VLMs to generate detailed contextual descriptions of video content, providing richer prompts for LLMs to understand and reason about event temporal relations. Furthermore, we introduce an adaptive event boundary refinement strategy, ensuring better coverage of the full event phases. Our framework demonstrates superior performance in zero-shot settings on several benchmark datasets, including Charades-STA and ActivityNet Captions, and exhibits remarkable robustness in out-of-distribution (OOD) scenarios.
Loading