Abstract: Having an informative and dense reward function is an important requirement to efficiently solve goal-reaching tasks. While the natural reward for such tasks is a binary signal indicating success or failure, providing only a binary reward makes learning very challenging given the sparsity of the feedback. Hence, introducing dense rewards helps to provide smooth gradients. However, these functions are not readily available, and constructing them is difficult, as it often requires a lot of time and domain-specific knowledge, and can unintentionally create spurious local minima. We propose a method that learns neural all-pairs shortest paths, used as a distance function to learn a policy for goal-reaching tasks, requiring zero domain-specific knowledge. In particular, our approach includes both a self-supervised signal from the temporal distance between state pairs of an episode, and a metric-based regularizer that leverages the triangle inequality for an additional connectivity information between state triples. This dynamical distance can be either used as a cost function, or reshaped as a reward, and, differently from previous work, is fully self-supervised, compatible with off-policy learning and robust to local minima.
Supplementary Material: zip