Abstract: We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are
trained end-to-end using either unstructured memory such
as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works,
our key insight is that the association between language
and vision is stronger when it occurs in explicit spatial
representations. In this work, we propose a cross-modal
map learning model for vision-and-language navigation
that first learns to predict the top-down semantics on an
egocentric map for both observed and unobserved regions,
and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven
navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark
0 Replies
Loading