Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Maximum a Posteriori Policy Optimisation
Nov 07, 2017 (modified: Nov 07, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:We introduce a new algorithm for reinforcement learning called Maximum a-posteriori Policy Optimisation (MPO) based on coordinate ascent on a relative-entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings.
Keywords:Reinforcement Learning, Variational Inference, Control
Enter your feedback below and we'll get back to you as soon as possible.