Online 3D Instance Segmentation at task-oriented granularity with Unposed Monocular Video

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Instance Segmentation, Online 3D Scene Segmentation, 3D Task-oriented Segmentation, Embodied AI
TL;DR: A novel online approach for task-oriented 3D instance segmentation from unposed monocular video
Abstract: We present a real-time, task-oriented 3D instance segmentation framework for unposed monocular video, enabling embodied agents to task-adaptively perceive and interact with objects in open-world scenes. Unlike most previous bottom-up segmentation paradigm that segment before recognition, we adopt a task-oriented segmentation approach. Specifically, objects are decoupled within each frame using an open-vocabulary detector combined with a prompt-based 2D segmentation model, while the 3D underlying geometry of the scene is simultaneously being reconstructed using a modern dense SLAM system. Guided by the SLAM-derived pose graph, we selectively associate multi-view masks and reuse the dense correspondences provided by the SLAM system, incrementally converting them into geometric association scores with minimal additional computation. By incorporating semantic similarity and mutual exclusivity metrics, we design a priority-ordered mask clustering algorithm for efficient online multi-view mask matching and merging. Evaluations on open-vocabulary 3D instance segmentation benchmarks show that our method effectively mitigates the performance degradation of existing approaches when using dense SLAM reconstructions instead of depth-sensor point clouds. On the Replica dataset, using only unposed images, it even achieves results comparable to methods leveraging ground-truth depth and poses. Codes will be released upon acceptance of the paper.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10870
Loading