Keywords: reinforcement learning, imitation learning, language grounding, instruction following
TL;DR: We proposed an approach to train an agent to follow instructions using only examples of the respective goal-states.
Abstract: Training agents to follow instructions requires some way of rewarding them for behavior which accomplishes the intent of the instruction. For non-trivial instructions, which may be either underspecified or contain some ambiguity, it can be difficult or impossible to specify a reward function or obtain relatable expert trajectories for the agent to imitate. For these scenarios, we introduce a method which requires only pairs on instructions and examples of positive goal states, from which we can jointly learn a model of the instruction-conditional reward and a policy which executes instructions. Two sets of experiments in a gridworld compare the effectiveness of our method to that of RL when a reward function can be specified, and the application of our method when no reward function is defined. We furthermore evaluate the generalization of our approach to unseen instructions, and to scenarios where environment dynamics change outside of training, requiring fine-tuning of the policy ``in the wild''.