Abstract: With the rapid technological advancement in the field of computer vision, building 3D language field models to support 3D open language queries has recently received increasing attention. This article introduces OSH-Splat, which constructs a 3D language field that allows for accurate and efficient open-ended lexical queries in 3D space. Firstly, we utilize the segment anything model to extract hierarchical semantic information at three levels: part, subpart, and whole. This not only addresses the target disambiguation problem but also produces pixel-aligned CLIP embeddings. Then, to reduce memory consumption, we employ a scene-specialized encoder-decoder pair. In the second stage of training, semantic features are learned as 3D Gaussian splatting features, which expand the 3D language field to support semantic queries. Furthermore, we propose optimizable semantic hyperplane (OSH), an innovative query strategy that enhances our 3D language feature Gaussians, which has moved away from traditional methods relying on fixed empirical thresholds and shows better accuracy and robustness in 3D semantic segmentation tasks. For each text query, OSH is iteratively optimized with the help of the reference expression segmentation model to achieve accurate target region localization. Extensive experimental results show that our approach outperforms state-of-the-art methods.
External IDs:dblp:journals/vc/XuJTM25
Loading