Keywords: video object detection, video object tracking, human-in-the-loop, continuity
TL;DR: we present a novel framework for video object observation for video object detection and video object tracking with human-intervention
Abstract: Video understanding needs both detecting objects in individual frames and maintaining the object identities across time. However conventional methods separate detection and tracking, leading to failures under long-term occlusion, abrupt appearance changes, and the emergence of novel objects. These challenges are particularly severe in dynamic and open-world environments, where objects frequently disappear, reappear, or evolve in appearance, and once identity is lost, automated systems rarely recover. Consequently, a formulation incorporating human-intervention is required to ensure reliable and adaptive continuity. To alleviate this, we introduce Video Object Observation (VOO), a new task unifying detection, tracking, and hunam-intervention, thereby shifting the focus from frame-level recognition to consistent sequence-level observation. To realize this, in this paper, we propose VOOV (Video Object Observer with human-interVention), the first framework explicitly designed for VOO. VOOV integrates three complementary memory modules, such as Originate, Sequential, and Long Term, that jointly encode semantic identity and temporal context, while an Orbital Deformable Attention mechanism models object motion probabilistically. Sparse human-intervention, including initialization, bounding box correction, and target switching, is systematically incorporated into memory, thereby enabling online adaptation without retraining. Experiments on multiple benchmarks demonstrate that VOOV achieves SotA performance, providing robust and real-time observation across diverse and challenging scenarios.
Primary Area: learning theory
Submission Number: 17685
Loading