LEBP--Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents

Abstract: People always desire an embodied agent that
can perform a task by understanding language instruction.
Moreover, they also want to monitor and expect agents to
understand commands the way they expected. But, how to build
such an embodied agent is still unclear. Recently, people can
explore this problem with the Vision-and-Language Interaction
benchmark ALFRED, which requires an agent to perform
complicated daily household tasks following natural language
instructions in unseen scenes. In this paper, we propose LEBP
– Language Expectation with Binding Policy Module to tackle
the ALFRED. The LEBP contains a two-stream process: 1) it
first conducts a language expectation module to generate an
expectation describing how to perform tasks by understanding
the language instruction. The expectation consists of a sequence
of sub-steps for the task (e.g., Pick an apple). The expectation
allows people to access and check the understanding results
of instructions before the agent takes actual actions, in case
the task might go wrong. 2) Then, it uses the binding policy
module to bind sub-steps in expectation to actual actions
to specific scenarios. Actual actions include navigation and
object manipulation. Experimental results suggest our approach
achieves comparable performance to currently published SOTA
methods and can avoid large decay from seen scenarios to
unseen scenarios.
0 Replies
Loading