Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
TRUNCATED HORIZON POLICY SEARCH: DEEP COMBINATION OF REINFORCEMENT AND IMITATION
Nov 07, 2017 (modified: Nov 07, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:Combination of Reinforcement Learning and Imitation Learning brings the best from both: we can quickly learn by imitating near-optimal oracles that achieve good performance on the task, while also explore and exploit to improve what we have learned. In this paper, we propose a novel way to combine imitation and reinforcement via the idea of reward shaping using such an oracle. We theoretically study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner’s planning horizon as function of its accuracy. When the oracle is sub-optimal, to ensure to find a policy that can outperform the oracle, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximizing the total reshaped reward over a finite planning horizon. We experimentally demonstrate a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.
TL;DR:Combining Imitation Learning and Reinforcement Learning to learn to outperform the expert