Learning to Respond:  A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding

Jiawen Qian; Hang Du; Guoshun Nan; Shan Huang; Jiaqi Yu; Haoqing Wang; Jingfeng Chen; Mengxue Cai; Mengdi Yang; Jiarong Li; Zhilong Li; Hua wang; Jun Liu; Xudong Jiang; Sicong Leng

Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding

Jiawen Qian, Hang Du, Guoshun Nan, Shan Huang, Jiaqi Yu, Haoqing Wang, Jingfeng Chen, Mengxue Cai, Mengdi Yang, Jiarong Li, Zhilong Li, Hua wang, Jun Liu, Xudong Jiang, Sicong Leng

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Online Video Understanding, Multimodal Large Language Model

Abstract: The rapid growth of online video platforms resulted in vast amounts of streaming and surveillance content, creating an urgent demand for real-time video understanding. Unlike offline tasks, online video understanding emphasizes proactive responsiveness, where models must detect when sufficient evidence has appeared in the stream to answer a given question (\emph{trigger}) and respond immediately. However, current studies provide insufficient exploration of such capabilities. To bridge this gap, we introduce TV-Online (Trigger Video-Online), a large-scale dataset with $50K$ videos, $200K$ questions, and $500K$ time-stamped answers. TV-Online covers progressively complex trigger-based tasks, ranging from basic temporal grounding to asynchronous scheduling and multi-trigger reasoning. These tasks motivate an agent-like modeling paradigm in which the system continuously processes streaming inputs and decides at each step whether to respond or remain silent. We instantiate this paradigm with a streaming-oriented model that employs protocol-level tagging and structured state management to regulate frame-by-frame decisions, ensuring precise response timing and consistent handling of asynchronous, multi-turn triggers. To endow the model with such capabilities, we adopt a progressive training strategy that leverages difficulty annotations in TV-Online and reinforcement objectives to shape responsiveness, coverage, and coherence across evolving interactions. Finally, we introduce a unified evaluation metric that integrates semantic, temporal, and coverage dimensions to holistically assess online video understanding. Extensive experiments demonstrate that TV-Online, together with the proposed model, training strategy, and metric, provides a comprehensive benchmark for advancing trigger-oriented online video understanding toward practical real-time video intelligence.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10948

Loading