Keywords: video understanding, temporal grounding, multimodal large language model (MLLM)
TL;DR: We present LongVTG-R1, the first RL-based framework for long-video temporal grounding, which leverages novel regularization, reward design, and dataset construction to achieve state-of-the-art performance and generalize to QA tasks.
Abstract: We present the first Reinforcement Learning (RL)-based framework that equips Multimodal Large Language Models (MLLMs) with long-video temporal grounding skills, and demonstrate that this approach also generalizes to improve performance on general question-answering (QA) tasks. Unlike dominant supervised fine-tuning (SFT) methods, RL enables models to acquire temporal grounding abilities without risking catastrophic forgetting of their core understanding. However, adopting RL for long-video temporal grounding reveals a challenge in balancing exploitation of pre-trained knowledge with exploration of new localization skills. To address this, we propose Token-aware KL Regularization, which selectively relaxes the KL-divergence regularization on timestamp-related tokens to guide exploration. Moreover, effective optimization requires a learning signal that alleviates the sparsity of key events in long videos, for which we introduce a denser reward, the Center Distance Reward (CenDist). To further mitigate grounding ambiguity between language queries and visually similar content, and to facilitate effective RL training, we propose an automatic data construction method and construct a small but high-quality dataset, SceneTG. Our resulting model, QwenLongTG, delivers substantial improvements across three long-video temporal grounding datasets among efficiently fine-tuned MLLMs, and even approaches the performance of densely pre-trained or continually trained models. Beyond temporal grounding, we further verify its generalization to long-video QA: under a “Ground-then-Answer” strategy, QwenLongTG consistently enhances downstream QA performance, serving as an effective first-stage grounding module.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15694
Loading