Keywords: Language 3D Gaussian Splatting, 3DGS Compression, Scene Understanding, Open Vocabulary Querying
TL;DR: A unified framework for training and compressing language 3DGS models, producing highly compact and semantically accurate representations.
Abstract: Language 3D Gaussian Splatting (3DGS) has exhibited promising advancements in open-vocabulary 3D scene understanding, incorporating semantic features from pretrained vision-language models into Gaussians to encode the semantic information of a scene. However, language-embedded 3DGS suffers from high computational and storage costs due to the massive number of Gaussians and the extra high-dimensional semantic attributes, which hinder its practical application. Existing compression methods primarily reduce 3DGS model redundancy through pruning or quantization, which can be sequentially applied to obtain a highly compressed language-embedded 3DGS model as a straightforward solution. However, all the existing approaches are not designed for compressing language 3DGS, where rich semantic features are ignored during the compression stages, leading to severe semantic information loss and significantly degraded scene understanding performance. Furthermore, the disjoint nature of the pruning and quantization stages results in lower rendering quality. To address these issues, we propose CoLaSplat, a unified compression framework for compact language 3DGS. CoLaSplat formulates semantic learning, sparsification, and vector quantization as a single optimization problem, constrained by the number of Gaussian primitives and vector quantization objective, seamlessly integrating the optimization procedure into the training process and incorporating language embeddings. To solve the unified optimization problem, we develop an efficient primal-dual optimization scheme by solving their associated subproblems and updating the variables separately, progressively compacting the model while preserving semantic and RGB rendering fidelity. Moreover, we theoretically analyze the convergence and stability of the proposed framework. Extensive experiments on 3D semantic segmentation and object localization demonstrate that our proposed CoLaSplat brings substantial efficiency gains while maintaining high task performance. Specifically, CoLaSplat achieves up to $15\times$ model size reduction, $147\times$ faster inference, and $6.7\times$ lower memory usage.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8120
Loading