Abstract: In “When to Trust You Model: Model-Based Policy Optimization,” Janner et al. establish a theoretical framework encouraging model usage in reinforcement learning and propose an algorithm that surpasses the performance of state-of-the-art model-based and model-free algorithms. Specifically, the model-based policy optimization (MBPO) algorithm uses short model rollouts branched from real data to reduce the compounding errors of inaccurate models and decouple the model horizon from task horizon, allowing it to achieve higher performance, faster convergence, and better generalizability in complex environments. Our report studies the replicability of the algorithm, examining if the algorithm is able to achieve the performance as the paper claims and if any unstated assumptions about implementation benefits generalization. We reproduce the graphs comparing the learning curve of MBPO with various state-of-the-art model-based and model-free algorithms in the MuJoCo environment, and the result confirms that of the original paper. We also find additional specifications and limitations of the algorithm, both theoretically and empirically.
NeurIPS Paper Id: https://openreview.net/forum?id=BJg8cHBxUS¬eId=SkxReJ3JuH