Keywords: embodied ai, semantic navigation, multiobject navigation, vision-language, language understanding, grounding;
TL;DR: richer attribute-aware and spatially-aware language understanding in semantic navigation via improved map representation
Abstract: Recent progress in large vision-language models has driven improvements in
language-based semantic navigation, where an embodied agent must reach a target
object described in natural language. Yet we still lack a clear, language-focused
evaluation framework to test how well agents ground the words in their instruc-
tions. We address this gap by proposing LangNav, an open-vocabulary multi-
object navigation dataset with natural language goal descriptions (e.g. ‘go to
the red short candle on the table’) and corresponding fine-grained linguistic an-
notations (e.g., attributes: color=red, size=short; relations: support=on). These
labels enable systematic evaluation of language understanding. To evaluate on
this setting, we extend multi-object navigation task setting to Language-guided
Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals
specified using language. Furthermore, we propose Multi-Layered Feature Map
(MLFM), a novel method that builds a queryable, multi-layered semantic map
from pretrained vision-language features and proves effective for reasoning over
fine-grained attributes and spatial relations in goal descriptions. Experiments on
LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based
navigation baselines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5285
Loading