Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

TMLR Paper6215 Authors

15 Oct 2025 (modified: 03 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=vUpCxVeWJN
Changes Since Last Submission: Change in the definition 3.1 from inequality to strict inequality. Correction of the proof of Theorem 3.2 in Appendix A.3. Additional ablation study in subsection 4.2. Minor text and figures revision.
Assigned Action Editor: ~Matteo_Papini1
Submission Number: 6215
Loading