Graph-Based Environment Representation for Vision-and-Language Navigation in Continuous Environments

Ting Wang, Zongkai Wu, Feiyu Yao, Donglin Wang

Published: 01 Jan 2024, Last Modified: 04 Nov 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires an agent to follow a language instruction in a realistic environment. Understanding the environment is crucial, yet current methods are relatively simple and direct, without delving into the interplay between language instructions and visual context. Therefore, we propose a novel environment representation. First, we propose an Environment Representation Graph (ERG) through object detection to express the environment in semantic level. Then, relational representations of object-object and object-agent in the ERG are learned through Graph Convolution Network (GCN), so as to obtain a continuous ERG expression. Sequentially, we combine the ERG expression with object label embeddings to obtain the environment representation. Finally, a new cross-modal attention navigation framework is proposed, incorporating our environment representation and a specialized loss function for ERG training. Experimental results demonstrate the effectiveness of our approach in achieving commendable performance on VLN-CE tasks.