Keywords: video understanding, video temporal grounding
Abstract: Video temporal grounding aims to identify the start and end timestamps of a target event in videos with varying duration. Most of existing methods are trained mainly on short videos through supervised fine-tuning. It is challenging for them to handle long videos, which show diverse data distributions, and thus require to perform reasoning with semantic cues.
To conquer this challenge, we propose VTG-Reasoner, a reinforcement fine-tuning framework to enhance the model’s reasoning ability for long video temporal grounding. Instead of directly supervising model outputs, VTG-Reasoner explores multiple temporal grounding predictions based on video contexts through an explicit reasoning process. These exploration predictions are then evaluated by our proposed IoU and Intersection Compactness reward to optimize the model.
To further enhance the reasoning performance, we adopt relative frame number to replace absolute timestamps, providing a unified temporal representation for videos with varying duration. Quantitative results demonstrate that VTG-Reasoner achieves superior performance on four long video temporal grounding benchmarks in a zero-shot manner, outperforming SFT-based models trained with 20$\times$ amount of data.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7009
Loading