Gradient Information Matters in Policy Optimization by Back-propagating through Model

Chongchong Li; Yue Wang; Wei Chen; Yuting Liu; Zhi-Ming Ma; Tie-Yan Liu

Gradient Information Matters in Policy Optimization by Back-propagating through Model

Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 PosterReaders: Everyone

Keywords: Model-based RL, Policy Optimization

Abstract: Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In addition to treating the learned environment like a black-box simulator, a more effective way to use the model is to exploit its differentiability. Such methods require the gradient information of the learned environment model when calculating the policy gradient. However, since the error of gradient is not considered in the model learning phase, there is no guarantee for the model's accuracy. To address this problem, we first analyze the convergence rate for the policy optimization methods when the policy gradient is calculated using the learned environment model. The theoretical results show that the model gradient error matters in the policy optimization phrase. Then we propose a two-model-based learning method to control the prediction error and the gradient error. We separate the different roles of these two models at the model learning phase and coordinate them at the policy optimization phase. After proposing the method, we introduce the directional derivative projection policy optimization (DDPPO) algorithm as a practical implementation to find the optimal policy. Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks.

One-sentence Summary: Considering the gradient information in the model learning is crucial for the model-based policy optimization according to our theoritical results. Motivated by such conclusion, we design a novel DDPPO algorithm that can achieve the SOTA performance.

27 Replies

Loading