Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Distilled radiance fields, Robotics, Geometry-grounded visual semantics, Gaussian Splatting, NeRFs
TL;DR: We explore the geometry-grounded semantic features in distilled radiance fields and find that although these features provide finer geometric detail, they do not outperform purely visual semantic features.
Abstract: Pretrained semantics from large vision models have enabled major advances in open-vocabulary robot policies, e.g., in manipulation and navigation. However, a striking lack of consensus on the performance and effects of fine-tuning these vision encoders remains a significant challenge. For example, some papers claim that (task-specific) pretrained encoders outperform general-purpose semantic encoders (e.g., DINO) or that fine-tuning vision encoders improves performance, while others claim the exact opposite. In this work, we seek to address these long-standing divisions through a principled examination of pretrained semantics from vision encoders in robotics. We hypothesize that the inconsistencies in prior work arise from a fundamental lack of insight into the feature content of these vision encoders. Hence, we undertake a systematic study of pretrained semantics in distilled fields to uncover their salient components with the goal of identifying a framework that explains previously contradictory claims. Specifically, we ask: *what do the semantic features of robotics vision encoders contain?* — and consider visual-semantic encoders (like DINO) and geometry-grounded encoders (like MUSt3R/VGGT). Notably, we find that these encoders attend to different features in their image inputs. While visual-semantic encoders prioritize object/part-level semantic decomposition, geometry-grounded encoders may discard this information to focus on more structural components, such as edges and corners. This observation can be described by catastrophic forgetting of core semantic information, which worsens with increased fine-tuning. We validate these findings in two major robotics problems: semantic object localization and radiance field inversion, using distilled fields as a testbed. We observe results consistent with the internal contents of the semantic features of these encoders, highlighting the strong explainability afforded by internal probes. For semantics-focused radiance field inversion, we propose a novel framework SPINE using distilled semantics for coarse inversion followed by a fine inversion procedure with photometric-based optimization, without an initial guess, demonstrating its superior performance compared to competitive alternatives. Further, our results suggest that geometry-grounding could offer potential benefits if catastrophic forgetting is controlled.
Supplementary Material: pdf
Primary Area: applications to robotics, autonomy, planning
Submission Number: 18317
Loading