Abstract: We propose to extend existing deep reinforcement learning (Deep RL) algorithms by allowing them to additionally choose sequences of actions as a part of their policy. This modification forces the network to anticipate the reward of action sequences, which, as we show, improves the exploration leading to better convergence. We propose a method that squeezes more gradients from the same number of episodes and thereby achieves higher scores and converges faster. Our proposal is simple, flexible, and can be easily incorporated into any Deep RL framework. We show the power of our scheme by consistently outperforming the state-of-the-art GA3C algorithm on popular Atari Games.
Keywords: Reinforcement Learning, A3C, Actor Critic
TL;DR: We propose to augment action space in A3C algorithm and extracting more gradients from episodes played, thereby achieving higher scores and faster convergence than the state-of-the-art GA3C.
7 Replies
Loading