Keywords: language, splatting
Abstract: To enable AI agents to interact seamlessly with humans and 3D environments, they must accurately perceive 3D spaces and align language with spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline pre-processing of language features for each input image, limiting adaptability to new environments. In this work, we introduce Online Language Splatting, the first framework enabling near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, and (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities. Experiments show our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than 40× efficiency boost, demonstrating the potential for dynamic and interactive AI applications.
Submission Number: 3
Loading