Common sense and Semantic-Guided Navigation via Language in Embodied Environments

Dian Yu; Chandra Khatri; Alexandros Papangelis; Mahdi Namazifar; Andrea Madotto; Huaixiu Zheng; Gokhan Tur

Common sense and Semantic-Guided Navigation via Language in Embodied Environments

Dian Yu, Chandra Khatri, Alexandros Papangelis, Mahdi Namazifar, Andrea Madotto, Huaixiu Zheng, Gokhan Tur

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

Abstract: One key element which differentiates humans from artificial agents in performing various tasks is that humans have access to common sense and semantic understanding, learnt from past experiences. In this work, we evaluate whether common sense and semantic understanding benefit an artificial agent when completing a room navigation task, wherein we ask the agent to navigate to a target room (e.g. ``go to the kitchen"), in a realistic 3D environment. We leverage semantic information and patterns observed during training to build the common sense which guides the agent to reach the target. We encourage semantic understanding within the agent by introducing grounding as an auxiliary task. We train and evaluate the agent in three settings: (i)~imitation learning using expert trajectories (ii)~reinforcement learning using Proximal Policy Optimization and (iii)~self-supervised imitation learning for fine-tuning the agent on unseen environments using auxiliary tasks. From our experiments, we observed that common sense helps the agent in long-term planning, while semantic understanding helps in short-term and local planning (such as guiding the agent when to stop). When combined, the agent generalizes better. Further, incorporating common sense and semantic understanding leads to 40\% improvement in task success and 112\% improvement in success per length (\textit{SPL}) over the baseline during imitation learning. Moreover, initial evidence suggests that the cross-modal embeddings learnt during training capture structural and positional patterns of the environment, implying that the agent inherently learns a map of the environment. It also suggests that navigation in multi-modal tasks leads to better semantic understanding.

Original Pdf: pdf

4 Replies

Loading