Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding
Keywords: Online Video Understanding, Multimodal Large Language Model
Abstract: The rapid growth of online video platforms resulted in vast amounts of streaming and surveillance content, creating an urgent demand for real-time video understanding.
Unlike offline tasks, online video understanding emphasizes proactive responsiveness, where models must detect when sufficient evidence has appeared in the stream to answer a given question (\emph{trigger}) and respond immediately.
However, current studies provide insufficient exploration of such capabilities.
To bridge this gap, we introduce TV-Online (Trigger Video-Online), a large-scale dataset with $50K$ videos, $200K$ questions, and $500K$ time-stamped answers.
TV-Online covers progressively complex trigger-based tasks, ranging from basic temporal grounding to asynchronous scheduling and multi-trigger reasoning.
These tasks motivate an agent-like modeling paradigm in which the system continuously processes streaming inputs and decides at each step whether to respond or remain silent.
We instantiate this paradigm with a streaming-oriented model that employs protocol-level tagging and structured state management to regulate frame-by-frame decisions, ensuring precise response timing and consistent handling of asynchronous, multi-turn triggers.
To endow the model with such capabilities, we adopt a progressive training strategy that leverages difficulty annotations in TV-Online and reinforcement objectives to shape responsiveness, coverage, and coherence across evolving interactions.
Finally, we introduce a unified evaluation metric that integrates semantic, temporal, and coverage dimensions to holistically assess online video understanding.
Extensive experiments demonstrate that TV-Online, together with the proposed model, training strategy, and metric, provides a comprehensive benchmark for advancing trigger-oriented online video understanding toward practical real-time video intelligence.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 10948
Loading