Abstract: We address the interactive instruction following task which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following natural language instructions with egocentric vision. To successfully achieve a goal in the interactive instruction following task, the agent should infer a sequence of actions and object interactions. When performing actions, a small field of view often limits the agent’s understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent.
0 Replies
Loading