Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang; Ziheng Wang; Boshen Xu; Yang Du; Kejun Lin; Zihan Xiao; Zihao Yue; Jianzhong Ju; Liang Zhang; Dingyi Yang; Xiangnan Fang; Zewen He; Zhenbo Luo; Wenxuan Wang; Junqi Lin; Jian Luan; Qin Jin

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large vision language model, temporal video grounding, reinforcement learning, post-training

TL;DR: We present Time-R1, a reinforcement learning-based post-training framework that achieves state-of-the-art performance across large vision language models for temporal video grounding.

Abstract: Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities. Project Page: https://xuboshen.github.io/Time-R1/.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 12224

Loading