Keywords: Video Understanding, MLLM
TL;DR: we introduce a new problem, Task-oriented Temporal Grounding (ToTG) and propose TimeScope, a novel framework built upon progressive reasonina
Abstract: Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Task-oriented Temporal Grounding (\textbf{ToTG}), which aims to localize time intervals containing the necessary information based on a task’s natural description. Along with the definition, we also present \textbf{ToTG-Bench}, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose \textbf{TimeScope}, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through fine-grained moment partitioning. Additionally, we curate a high-quality dataset, namely \textbf{ToTG-Pile}, to enhance TimeScope’s ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporal-grounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12659
Loading