SceneSAM: Integrating 2D Labels for Weakly Supervised 3D Scene Understanding

Julius Koerner, Dogu Tamgac, David Rozenberszki

Published: 19 Jul 2024, Last Modified: 05 Mar 2025BMVC 2024EveryoneCC BY 4.0

Abstract: Reconstruction of class-agnostic segmented 3D scenes presents significant challenges as both the requirements for quality 3D reconstructions and object level annotations require significant hardware and human resources. Thus, we propose SceneSAM an efficient and weakly supervised 3D method capable of reconstructing and re-rendering in-the-wild room-scale scenes with class-agnostic instance masks from a single, unaligned video stream. We leverage a hierarchical grid based representation for implicit fields as a 3D representation and rely on the Segment Anything Model (SAM) for the class-agnostic instance annotations. Our proposed method trains an order of magnitude faster than previous state-of-the-art methods, while also preserving highly detailed segmentation masks and without relying on any closed vocabulary model. For consistent mask supervision of independent video frames, we also introduce a novel self-consistent video segmentation algorithm based on 3D grounded instance proposals. Finally, our approach is agnostic to video registration, as it can be used both with and without camera poses, saving additional significant amount of computation by replacing the industry standard COLMAP optimization with minimal loss in reconstruction quality. We evaluate our method both in synthetic and real-world datasets and show that efficient and robust scene reconstructions are possible both in the color and instance domain within reasonable time constraints.