GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret; Kunyi Li; Sen Wang; Siyun Liang; Michael Niemeyer; Stefano Gasperini; Nassir Navab; Federico Tombari

GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Reconstruction, 3D Scene Understanding, Open-Vocabulary Segmentation, Gaussian Splatting, Attention

TL;DR: GALA is an efficient open-vocabulary 3D scene understanding framework that enhances 3D Gaussian Splatting with instance learning and language-aligned features for unified 2D/3D semantic segmentation.

Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To further extend this to generalized language feature fields, we introduce a core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.

Supplementary Material: pdf

Submission Number: 64

Loading