Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper presents LaGa, a language-driven open-vocabulary 3D scene understanding method built upon 3D Gaussian splatting, designed to effectively handle the view dependency of 3D semantics.
Abstract: Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we term **view-dependent semantics**. To address this challenge, we propose **LaGa** (**La**nguage **Ga**ussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of **+18.7\% mIoU** over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/https://github.com/SJTU-DeepVisionLab/LaGa.
Lay Summary: When computers try to understand 3D scenes from multi-view images, they often face a key challenge: objects can look very different from different viewpoints. For example, a book may be easy to recognize from one angle but hard to identify from another. Many existing methods ignore this and simply project 2D semantic information from a single view onto the 3D scene. We introduce LaGa, a method that improves 3D understanding by recognizing that an object’s meaning can change with the viewpoint. LaGa first segments the 3D scene to get a set of 3D objects, and then collects semantic information from multiple views of each object to build a shared understanding across viewpoints. This helps computers better interpret complex scenes, especially when using natural language. Tested on a challenging benchmark, LaGa significantly outperforms previous approaches. It offers a step forward for applications like augmented reality, robotics, and virtual environments, where accurate 3D understanding is essential.
Link To Code: https://github.com/SJTU-DeepVisionLab/LaGa
Primary Area: Applications->Computer Vision
Keywords: 3D Gaussian splatting; open-vocabulary scene understanding; 3D scene understanding
Submission Number: 112
Loading