FineSplat: Fine-Grained 3D Open-Vocabulary Language Gaussian Splatting

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-Grained 3D Scene Understanding, Language Gaussian Splatting
Abstract: Existing open-vocabulary scene understanding methods are primarily limited to coarse-grained understanding at the object category level, making them incapable of handling fine-grained queries. In this paper, we introduce a challenging task of fine-grained open-vocabulary scene understanding and propose a novel fine-grained 3D language gaussian splatting framework, FineSplat for short. Unlike prior methods that rely on the vision-language alignment model, such as CLIP, FineSplat models the feature field solely from textual captions, transforming the cross-modal feature matching challenge into a retrieval process between queries and captions. Specifically, we design the Fine-Grained Caption Generation (FGCG) strategy to obtain captions containing multi-dimensional fine-grained attributes. Then, the Fine-Grained Feature Field Modeling (FGFFM) strategy is introduced to encode generated fine-grained captions into object-level semantic features, which subsequently supervise the training of 3D Gaussian representations. Furthermore, we construct Fine-OVS, a benchmark to support research and evaluation of the fine-grained open-vocabulary scene understanding task. Extensive experiments conducted on the Fine-OVS demonstrate that our FineSplat framework significantly outperforms existing state-of-the-art methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5633
Loading