ERNav: A Unified, Realistic Benchmark for Embodied AI with Exploration, Representation, and Navigation

ERNav: A Unified, Realistic Benchmark for Embodied AI with Exploration, Representation, and Navigation

ICLR 2026 Conference Submission19726 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Scene Understanding, Vision-Language, Vision-and-Language Navigation

Abstract: Current embodied AI benchmarks typically focus only on the final stage of the embodied process, such as following instructions or answering scene-related questions. These evaluations often unrealistically assume access to perfect perception data of the environment and overlook the earlier stages of exploration and representation construction, which are indispensable for real-world deployment. In addition, these benchmarks are often restricted to smaller-scale, room-level environments and short, object-centric instructions, falling to capture the complexity of larger buildings where agents must operate across multiple rooms and floors while reasoning over long instructions tied to global layouts. To address these gaps, we introduce ERNav, the first unified benchmark for embodied AI that integrates Exploration, Representation, and Navigation into an end-to-end task pipeline. In ERNav, agents must actively explore the environment, construct global representations from noisy RGB-D observations, and then localize targets directly from natural language instructions that often require reasoning over entire buildings. This unified formulation differs from existing benchmarks by aligning all stages of the embodied pipeline and scaling evaluation to realistic building-level settings, creating a challenging and practical testbed for embodied AI. We also propose 3D-LangNav as a strong baseline. As a divide-and-conquer framework, it employs a dual-sighted exploration strategy to collect diverse observations and construct high-quality 3D representations, followed by language grounding and spatial reasoning via a fine-tuned large language model (LLM). Extensive experiments show that ERNav poses significant new challenges for existing methods, while 3D-LangNav achieves strong performance, reaching more than twice the success rate (SR) of state-of-the-art 3D-MLLMs. Moreover, by structuring the task into three progressively harder, sequentially dependent subtasks as a whole pipeline, ERNav enables systematic analysis of how each stage contributes to overall performance, providing clear directions for future research.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19726

Loading