Learning to navigate by distilling visual information and natural language instructions


Nov 03, 2017 (modified: Dec 12, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: In this work, we focus on the problem in which an agent learns to navigate to the target object in a 2D grid environment. The agent receives visual information through raw pixels and a natural language instruction telling what task needs to be achieved. We propose a simple, attention based architecture for grounding natural language instructions in our environment. Our model does not have any prior information of both the visual and textual modalities and is end-to-end trainable. We develop an attention mechanism for multimodal fusion of visual and textual modalities. Our experimental results show that our attention mechanism outperforms the existing multimodal fusion mechanisms proposed in order to solve the above mentioned task. We demonstrate through the visualization of attention weights that our model learns to correlate attributes of the object referred in the instruction with visual representations and also show that the learnt textual representations are semantically meaningful as they follow vector arithmetic. We also show that our model generalizes effectively to unseen scenarios and exhibit \textit{zero-shot} generalization capabilities. In order to simulate the above described challenges, we introduce a new 2D environment for an agent to jointly learn visual and textual modalities.
  • TL;DR: Attention based architecture for language grounding via reinforcement learning in a new customizable 2D grid environment
  • Keywords: Deep reinforcement learning, Computer Vision, Multi-modal fusion, Language Grounding