VL-N3RD-Bench: Benchmarking Vision-Language Navigation with 3D Gaussian Splatting Reconstruction for Deployment

Published: 13 May 2026, Last Modified: 13 May 2026ICRA 2026: From Data to Decisions PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Gaussian Splatting, Vision-Language Navigation
TL;DR: This paper benchmarks 3D Gaussian Splatting for vision-language navigation, showing that perceptual fidelity helps indoor performance, but geometry and traversability are critical for sim-to-real transfer, especially outdoors.
Abstract: Recent advances in large language models (LLMs) have expanded robotic capabilities by bridging perception and action. Vision-language navigation (VLN) enables embodied agents to follow high-level semantic instructions in complex and unseen environments. Fine-tuning VLA models requires photorealistic training data, as insufficient visual fidelity may exacerbate sim-to-real gaps and degrade action execution. However, achieving such realism in simulation is challenging, and training is often conducted using synthetic views or egocentric video datasets. Gaussian Splatting (GS) has recently emerged as a promising solution for high-fidelity 3D scene reconstruction, offering a scalable alternative for generating training environments. Nevertheless, the impact of GS reconstruction quality on downstream VLN performance has not been systematically investigated. In this work, we benchmark state-of-the-art GS pipelines using public and customized datasets. We evaluate reconstruction quality via standard image-based metrics and computational efficiency. Finally, we use VL-N3RD-Bench to evaluate a pretrained VLA model on a Unitree A1 robot across simulated and physical environments. Our results demonstrate that overall reconstruction quality can align with stronger indoor VLN performance, but this relationship is scene-dependent, and strong image-based metrics alone do not guarantee better outdoor navigation when geometry and traversability become more important.
Submission Number: 12
Loading