VTG-Reasoner: Long Video Temporal Grounding via Reinforcement Fine-Tuning

Pengcheng Zhang; Dongkai Wang; Shiliang Zhang

VTG-Reasoner: Long Video Temporal Grounding via Reinforcement Fine-Tuning

Pengcheng Zhang, Dongkai Wang, Shiliang Zhang

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video understanding, video temporal grounding

Abstract: Video temporal grounding aims to identify the start and end timestamps of a target event in videos with varying duration. Most of existing methods are trained mainly on short videos through supervised fine-tuning. It is challenging for them to handle long videos, which show diverse data distributions, and thus require to perform reasoning with semantic cues. To conquer this challenge, we propose VTG-Reasoner, a reinforcement fine-tuning framework to enhance the model’s reasoning ability for long video temporal grounding. Instead of directly supervising model outputs, VTG-Reasoner explores multiple temporal grounding predictions based on video contexts through an explicit reasoning process. These exploration predictions are then evaluated by our proposed IoU and Intersection Compactness reward to optimize the model. To further enhance the reasoning performance, we adopt relative frame number to replace absolute timestamps, providing a unified temporal representation for videos with varying duration. Quantitative results demonstrate that VTG-Reasoner achieves superior performance on four long video temporal grounding benchmarks in a zero-shot manner, outperforming SFT-based models trained with 20$\times$ amount of data.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7009

Loading