Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

ICLR 2026 Conference Submission18317 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distilled radiance fields, Robotics, Geometry-grounded visual semantics, Gaussian Splatting, NeRFs

TL;DR: We explore the geometry-grounded semantic features in distilled radiance fields and find that although these features provide finer geometric detail, they do not outperform purely visual semantic features.

Abstract: Semantic distillation in radiance fields has spurred significant advances in open- vocabulary robot policies, e.g., in manipulation and navigation, founded on pre- trained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the ques- tion: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: (i) coarse inversion using distilled semantics, and (ii) fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

Supplementary Material: pdf

Primary Area: applications to robotics, autonomy, planning

Submission Number: 18317

Loading