- Keywords: Model-based RL, Policy Optimization
- Abstract: Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In addition to treating the learned environment like a black-box simulator, a more effective way to use the model is to exploit its differentiability. Such methods require the gradient information of the learned environment model when calculating the policy gradient. However, since the error of gradient is not considered in the model learning phase, there is no guarantee for the model's accuracy. To address this problem, we first analyze the convergence rate for the policy optimization methods when the policy gradient is calculated using the learned environment model. The theoretical results show that the model gradient error matters in the policy optimization phrase. Then we proposed a two-model-based learning method to control the prediction error and the gradient error. We separate the different roles of these two models at the model learning phase and coordinate them at the policy optimization phase. After proposed the method, we introduce the directional derivative projection policy optimization (DDPPO) algorithm as a piratical implementation to find the optimal policy. Finally, We empirically verify the effectiveness of the proposed algorithm and yield the state-of-the-art performance on sample efficiency through benchmark continuous control tasks.
- One-sentence Summary: Considering the gradient information in the model learning is crucial for the model-based policy optimization according to our theoritical results. Motivated by such conclusion, we design a novel DDPPO algorithm that can achieve the SOTA performance.