On-Policy Trust Region Policy Optimisation with Replay BuffersDownload PDF

27 Sept 2018 (modified: 22 Oct 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone
Abstract: Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to the on-policy algorithms. To achieve this, the proposed algorithm generalises the Q-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.
Keywords: reinforcement learning, on-policy learning, trust region policy optimisation, replay buffer
TL;DR: We investigate the theoretical and practical evidence of on-policy reinforcement learning improvement by reusing the data from several consecutive policies.
Code: [![github](/images/github_icon.svg) dkangin/baselines](https://github.com/dkangin/baselines) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=B1MB5oRqtQ)
Data: [MuJoCo](https://paperswithcode.com/dataset/mujoco)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:1901.06212/code)
8 Replies

Loading