NavTrust: Benchmarking Trustworthiness for Embodied Navigation

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

ICLR 2026 Conference Submission21346 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied Navigation, Vision–Language Navigation, Object–Goal Navigation, Robustness, Trustworthiness

TL;DR: NavTrust is a unified benchmark to evaluate agents under real-world corruptions. We introduce corruptions for RGB, depth data, and for the language instructions. Besides we explored three different mitigation methods to improve the robustness.

Abstract: Embodied navigation remains challenging due to cluttered layouts, complex semantics, and language-conditioned instructions. Recent breakthroughs in complex indoor domains require robots to interpret cluttered scenes, reason over long-horizon visual memories, and follow natural language instructions. Broadly, there are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, such as RGB, depth, and instructions, under realistic scenarios and evaluates their impact on navigation performance. To the best of our knowledge, NavTrust is the first benchmark to expose embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of six state-of-the-art approaches reveals substantial success-rate degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. As part of this roadmap, we systematically evaluate four distinct strategies: data augmentation, teacher-student knowledge distillation, safeguard LLM and lightweight adapter tuning, to enhance robustness. Our experiments offer a practical path for developing more resilient embodied agents.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 21346

Loading