TRUNCATED HORIZON POLICY SEARCH: DEEP COMBINATION OF REINFORCEMENT AND IMITATION

Anonymous

Nov 07, 2017 (modified: Nov 07, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Combination of Reinforcement Learning and Imitation Learning brings the best from both: we can quickly learn by imitating near-optimal oracles that achieve good performance on the task, while also explore and exploit to improve what we have learned. In this paper, we propose a novel way to combine imitation and reinforcement via the idea of reward shaping using such an oracle. We theoretically study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner’s planning horizon as function of its accuracy. When the oracle is sub-optimal, to ensure to find a policy that can outperform the oracle, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximizing the total reshaped reward over a finite planning horizon. We experimentally demonstrate a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.
  • TL;DR: Combining Imitation Learning and Reinforcement Learning to learn to outperform the expert
  • Keywords: Imitation Learning, Reinforcement Learning

Loading