- Abstract: Conventional reinforcement learning rarely considers how the physical variations in the environment (eg. mass, drag, etc.) affect the policy learned by the agent. In this paper, we explore how changes in the environment affect policy generalization. We observe experimentally that, for each task we considered, there exists an optimal environment setting that results in the most robust policy that generalizes well to future environments. We propose a novel method to exploit this observation to develop robust actor policies, by automatically developing a sampling curriculum over environment settings to use in training. Ours is a model-free approach and experiments demonstrate that the performance of our method is on par with the best policies found by an exhaustive grid search, while bearing a significantly lower computational cost.
- Keywords: Reinforcement Learning, Policy Robustness, Policy generalization, Automated Curriculum
- TL;DR: By formulating the learning curriculum as a bandit problem, we present a principled approach to motivating policy robustness in continuous controls tasks.