Keywords: Cirriculum learning, dynamics, reinforcement learning, sparse rewards
TL;DR: Learning a schedule for how much to assist an RL agent by modifying the physics of the world they operate in can speed up algorithm convergence time, enable learning with sparse rewards, and improve performance
Abstract: Humans often make the dynamics of a task easier (e.g. using training wheels on a bicycle or a large voluminous surfboard) when first learning a skill before tackling the full task with more difficult dynamics (riding a bike without training wheels, surfing a smaller board). This can be thought of as a form of curriculum learning. However, this is not the paradigm currently used for training agents using reinforcement learning (RL). In many cases, agents are thrown into the final environment, and must learn a policy from scratch in the context of the final dynamics. While previous work on curriculum learning for deep RL has sought to address this problem by changing the tasks agents are solving, or the starting position of the agent, no work has derived a curriculum by modifying the dynamics of the final environment. Here, we study using assist - simplifying task dynamics - to accelerate and improve the learning process for RL agents. First, we modify the physics of theLunarLander-v2 and FetchReach-v1 environments to allow us to adjust the amount of assist provided with a single parameter $\alpha$, which scales the amount which an agent is nudged and hence assisted towards a known end goal during training. We then show that we can automatically learn schedules for assist using a population based training approach that results in faster agent convergence on the evaluation environment without any assist, and better performance across continuous control tasks using state of the art policy gradient algorithms (proximal policy optimization). We show that our method can also scale to off policy methods such as Deep Deterministic Policy Gradients. Furthermore, we show that for tasks with sparse rewards, assist is critical to agent learning as it allows exploration of high-reward areas and use of algorithms that fail to learn the task without assist. We also uncover that population based tuning approaches stabilize training of policy gradients without tuning of any additional hyperparameters.