Keywords: streaming, multimodal, video language
Abstract: Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding---Streaming Detection of Queried Event Start (SDQES).
The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency.
We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting.
Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling.
We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.
Supplementary Material: zip
Submission Number: 194
Loading