Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Improving Policy Gradient by Exploring Under-appreciated Rewards
Ofir Nachum, Mohammad Norouzi, Dale Schuurmans
Nov 04, 2016 (modified: Mar 03, 2017)ICLR 2017 conference submissionreaders: everyone
Abstract:This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring only small modifications to the standard REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. We find that our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Notably, the approach is able to solve a benchmark multi-digit addition task. To our knowledge, this is the first time that a pure RL method has solved addition using only reward feedback.
TL;DR:We present a novel form of policy gradient for model-free reinforcement learning with improved exploration properties.
Enter your feedback below and we'll get back to you as soon as possible.