Proto-SaGa: Prototype-based 3D Scene Segmentation with Semantic-aware Gaussian Grouping

19 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Scene Understanding, 3D Scene Segmentation, Multi-view Segmentation, 3D Gaussian Splatting
Abstract: Segment anything models (SAM), trained with lots of ground-truth labels, have achieved strong performance in 2D scene segmentation. Compared to this, accurate 3D scene segmentation remains challenging, since annotating consistent segmentation masks across multiple views is highly labor-intensive. To address this, many approaches have been proposed using inconsistent masks predicted by SAM as pseudo labels. They typically build on 3D Gaussian splatting (3DGS) to synthesize and segment novel views in a 3D scene simultaneously. To be specific, several 3DGS-based methods focus on associating the inconsistent masks across training views so that a classifier is trained with the associated masks. They however have two limitations: (1) the association process considers only the location of each 3D Gaussian in the scene and (2) training a classifier with the associated masks is prone to overfitting to incorrect labels of the associated masks. We introduce in this paper Proto-SaGa, a novel 3DGS-based framework that addresses the aforementioned limitations. Specifically, we present a semantic-aware mask association strategy that exploits both location and high-level semantics of each Gaussian to improve the consistency of the associated masks. We also propose a novel inference scheme that alleviates the influence of possibly incorrect results within the associated masks. Specifically, we obtain a set of prototypes by averaging features with the consistent masks, and use it as a classifier at test time without further training. Extensive experiments on Replica, LERF-Mask, ScanNet, and Mip-NeRF 360 demonstrate the effectiveness of our approach. We will make our code publicly available upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15496
Loading