MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

Sonia Raychaudhuri; Enrico Cancelli; Tommaso Campari; Lamberto Ballan; Manolis Savva; Angel X Chang

MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X Chang

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: embodied ai, semantic navigation, multiobject navigation, vision-language, language understanding, grounding;

TL;DR: richer attribute-aware and spatially-aware language understanding in semantic navigation via improved map representation

Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instruc- tions. We address this gap by proposing LangNav, an open-vocabulary multi- object navigation dataset with natural language goal descriptions (e.g. ‘go to the red short candle on the table’) and corresponding fine-grained linguistic an- notations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5285

Loading