Keywords: reinforcement learning, planning, variational inference, curriculum learning, waypoints, subgoals
Abstract: Goal-conditioned reinforcement learning (RL) has shown great success recently at solving a wide range of tasks(e.g., navigation, robotic manipulation). However, learning to reach distant goals remains a central challenge to the field, and the task is particularly hard without any expert demonstrations and reward shaping. In this paper, we propose to solve the distant goal-reaching task by using search at training time to generate a curriculum of intermediate states. Specifically, we introduce the algorithm Classifier Planning (C-Planning) by framing the learning of the goal-conditioned policies as variational inference. C-Planningnaturally follows expectation maximization (EM): the E step corresponds to planning an optimal sequence of waypoints using graph search, while the M step corresponds to learning a goal-conditioned policy to reach those waypoints. One essential difficulty of designing such an algorithm is accurately modeling the distribution over waypoints to sample from. In C-Planning, we propose to sample the waypoints using contrastive methods to learn a value function. Unlike prior methods that combine goal-conditioned RL with graph search, ours performs search only during training and not testing, significantly decreasing the compute costs of deploying the learned policy. Empirically, we demonstrate that our method not only improves the sample efficiency of prior methods but also successfully solves temporally-extended navigation and manipulation tasks, where prior goal-conditioned RL methods (including those based on graph search) fail to solve.
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2110.12080/code)