QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response
Keywords: Multimodal Large Language Model, Online Video Understanding, Streaming Video Understanding
Abstract: The increasing demand for real-time interaction in online video scenarios necessitates a new class of efficient streaming video understanding models. However, existing approaches often rely on a query-agnostic ''change-is-important'' assumption, which conflates visual dynamics with semantic relevance, leading to computational redundancy and mistimed responses. To address this, we propose QueryStream, a novel framework that integrates query-awareness into the core of video processing and response scheduling. QueryStream features two synergistic components: (1) Query-Aware Differential Pruning (QDP), a policy that filters the token stream by jointly assessing semantic relevance to the query and temporal novelty against a dynamically smoothed history; and (2) Relevance-Triggered Active Response (RTAR), a dual-gated mechanism that schedules responses based on both high query relevance and significant information density. As a lightweight, training-free module, QueryStream achieves state-of-the-art performance on benchmarks such as StreamingBench and OVO-Bench under moderate pruning, and matches full-token baselines while pruning over 70\% of visual tokens. Notably, our pruning mechanism generalizes to offline tasks, where it serves as a context-denoising module that benefits long-form video understanding. This work not only reveals the vast semantic redundancy in video streams relative to user intent but also establishes a promising, intent-driven direction for efficient and robust online video understanding. Code is available at: https://github.com/Zhangkr2003/QueryStream.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15489
Loading