Divergence-Augmented Policy OptimizationDownload PDF

Qing Wang, Yingru Li, Jiechao Xiong, Tong Zhang

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone
Abstract: In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to control the degree of off-policyness. Empirical experiments on Atari games show that in the data scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
Code Link: https://github.com/lns/dapo
CMT Num: 3280
0 Replies

Loading