Keywords: 3D Reconstruction, 3D Scene Understanding, Open-Vocabulary Segmentation, Gaussian Splatting, Attention
TL;DR: GALA is an efficient open-vocabulary 3D scene understanding framework that enhances 3D Gaussian Splatting with instance learning and language-aligned features for unified 2D/3D semantic segmentation.
Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To further extend this to generalized language feature fields, we introduce a core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.
Supplementary Material: pdf
Submission Number: 64
Loading