Abstract: We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To do so, we propose the use of metric learning to approximate the optimal value function instead of classic Temporal-Difference solutions that employ the Bellman operator for their value updates. This representation choice allows us to avoid the out-of-distribution issue caused by the \emph{max} operator of the critic update in the offline setting without any conservative or behavioral constraints on the value function. We introduce distance monotonicity, a property for representations to recover optimality and propose an optimization objective that leads to such property. We use the proposed value function to guide the learning of a policy in an actor-critic fashion, a method we name MetricRL. Experimentally, we show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors. We demonstrate that MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in learning optimal policies from sub-optimal offline datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Matteo_Papini1
Submission Number: 3547
Loading