CAPNAV: TOWARDS ROBUST INDOOR NAVIGATION WITH DESCRIPTION-FIRST MAPS

Ritali Vatsi; Ayush Ravindra Vaidande; Vikas Sharma; Siddharth Tiwari; Amit Shukla

CAPNAV: TOWARDS ROBUST INDOOR NAVIGATION WITH DESCRIPTION-FIRST MAPS

Ritali Vatsi, Ayush Ravindra Vaidande, Vikas Sharma, Siddharth Tiwari, Amit Shukla

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Navigation, Multi-View Captioning, Attribute-Guided Retrieval, LLM, Clustering

TL;DR: CapNav unifies attribute-aware grounding and perception to navigate to precisely described objects, outperforming category-only methods under ambiguous, multi-object scenes.

Abstract: Humans naturally form mental maps of their surroundings: they picture what a destination looks like, relate it to nearby objects, and implicitly plan a route before moving. We seek a similar capability for embodied agents: given a free-form description such as ``go to the white sofa with curved edges'', the agent should pick the correct 3D instance among many lookalikes and navigate to it safely. We propose CapNav, a description-first navigation framework that builds an instance-centric 3D map from RGB-D streams and uses natural-language object descriptions as the primary interface for goal selection. CapNav maintains a dense semantic voxel map for global geometry and, in parallel, constructs persistent 3D object tracks by aggregating Detic based open-vocabulary detections and LSeg features over time. For each stabilized track, a small set of views is captioned with a vision-language model and embedded with BGE-M3, yielding a caption-enriched representation that links language, semantics, and 3D pose. At test time, free-form instructions are encoded in the same text space, matched against object captions to select a target instance, and then converted into a metric-space waypoint followed by A* planning. CapNav shows consistent improvements over category-only and map-based baselines (ZSON, LM-Nav, CoW, VLMaps) in multi-object navigation tasks, and its instance-level captions make retrieval decisions transparent and easy to interpret.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11914

Loading