Observe Anything: Human-Intervened Video Understanding with Adaptive Orbital Memory

Seo-Yeon Choi; Kyungsu Lee

Observe Anything: Human-Intervened Video Understanding with Adaptive Orbital Memory

Seo-Yeon Choi, Kyungsu Lee

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: video object detection, video object tracking, human-in-the-loop, continuity

TL;DR: we present a novel framework for video object observation for video object detection and video object tracking with human-intervention

Abstract: Video understanding needs both detecting objects in individual frames and maintaining the object identities across time. However conventional methods separate detection and tracking, leading to failures under long-term occlusion, abrupt appearance changes, and the emergence of novel objects. These challenges are particularly severe in dynamic and open-world environments, where objects frequently disappear, reappear, or evolve in appearance, and once identity is lost, automated systems rarely recover. Consequently, a formulation incorporating human-intervention is required to ensure reliable and adaptive continuity. To alleviate this, we introduce Video Object Observation (VOO), a new task unifying detection, tracking, and hunam-intervention, thereby shifting the focus from frame-level recognition to consistent sequence-level observation. To realize this, in this paper, we propose VOOV (Video Object Observer with human-interVention), the first framework explicitly designed for VOO. VOOV integrates three complementary memory modules, such as Originate, Sequential, and Long Term, that jointly encode semantic identity and temporal context, while an Orbital Deformable Attention mechanism models object motion probabilistically. Sparse human-intervention, including initialization, bounding box correction, and target switching, is systematically incorporated into memory, thereby enabling online adaptation without retraining. Experiments on multiple benchmarks demonstrate that VOOV achieves SotA performance, providing robust and real-time observation across diverse and challenging scenarios.

Primary Area: learning theory

Submission Number: 17685

Loading