Keywords: Reinforcement Learning, Optimal Transport, Evolution Strategies
TL;DR: We seek to find the right measure of similarity between two policies, acting on the same underlying MDP, and devise algorithms to leverage this information for reinforcement learning.
Abstract: We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over trajectories that can be in turn used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.
Code: https://github.com/behaviorguidedRL/BGRL
Original Pdf: pdf
10 Replies
Loading