TAR-TVG: Enhancing LVLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

chaohong guo; Xun Mo; Yongwei Nie; Xuemiao Xu; Chao Xu; Fei Yu; Chengjiang Long

TAR-TVG: Enhancing LVLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

chaohong guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, Chengjiang Long

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Temporal Video Grounding

TL;DR: TAR-TVG is a reinforcement learning framework that improves Temporal Video Grounding by introducing timestamp anchors to guide reasoning, boosting both accuracy and interpretability.

Abstract: Temporal video grounding aims to localize relevant video segments based on a given query. Large Vision-Language Models (LVLMs) can address this by taking a video and query as input and outputting the time duration. Recently, some methods fine-tune LVLMs with reinforcement learning (RL), encouraging them to generate reasoning traces for better interpretability. They also prompt the model to include `<timestamp></timestamp>` tags into the reasoning process to strengthen the connection between the reasoning and the final output. However, these prompts only implicitly guide the model to output timestamp tags, often leading to missing, incorrect-formatted, or irrelevant tags. To address this issue, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG). By designing reinforcement learning reward functions, we explicitly enforce the inclusion of timestamp tags as anchors within the reasoning traces, providing explicit format control and accuracy validation based on soft IoU. Furthermore, when multiple timestamp anchors appear, the reward function is designed to ensure that the accuracy of these anchors progressively improves, thereby mimicking the human-like thought process of refining from coarse to fine. These additional constraints on timestamp anchors encourage the model to better understand the task of temporal video grounding, thereby improving its grounding performance. Additionally, we first run an RL stage purely for data collection. The collected samples are then used to SFT a fresh base model, and we finally apply RL fine-tuning to the SFT-initialized model. Experiments show that our model achieves state-of-the-art performance while producing verifiable reasoning chains with progressively refined temporal estimations.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10093

Loading