UNIT3D: Unified instance-relative transformer for indoor 3D object detection and segmentation

Published: 05 Apr 2026, Last Modified: 06 May 2026Pattern RecognitionEveryoneCC BY 4.0
Abstract: In this paper, we present UNIT3D, the first fully unified 3D scene understanding framework that bridges this gap by integrating object detection, semantic, instance, and panoptic segmentation tasks within a single model. We first examine the limits of current designs by adding task-specific heads to strong baselines, finding that simple multi-task extensions perform poorly on added tasks and even degrade the original ones. This reveals a fundamental architectural conflict between the geometric precision required for detection and the grouping adaptivity needed for segmentation. To address this, we propose the Unified Instance-relative Transformer, which replaces task-specific components with a shared, conflict-free query interaction mechanism. We discard the restrictive mask attention used in prior work and introduce an instance-relative position encoding that retains the benefits of soft mask attention while restoring the rigorous spatial cues necessary for box regression. Our decoder incorporates Spatial Aware Self Attention to encode instance centers for scene-level context and Vertex Guided Cross Attention to encode instance vertices for fine-grained details. This global-to-local design allows UNIT3D to directly output masks and bounding boxes in a fully end-to-end manner. Crucially, this strict one-to-one matching strategy eliminates the reliance on Non-Maximum Suppression (NMS), avoiding heuristic post-processing errors and ensuring robust detection even in crowded 3D scenes. Experiments on ScanNet, ScanNet200, and S3DIS demonstrate that UNIT3D achieves state-of-the-art results in 3D object detection while remaining competitive on segmentation tasks, proving that cross-paradigm unification can effectively facilitate mutual enhancement across domains. Code is available at github.com/liuxinrun/unit3d.
Loading