Maximum a Posteriori Policy OptimisationDownload PDF

15 Feb 2018, 21:29 (modified: 10 Feb 2022, 11:31)ICLR 2018 Conference Blind SubmissionReaders: Everyone
Keywords: Reinforcement Learning, Variational Inference, Control
Abstract: We introduce a new algorithm for reinforcement learning called Maximum a-posteriori Policy Optimisation (MPO) based on coordinate ascent on a relative-entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings.
Code: [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](
15 Replies