Conservative Offline Goal-Conditioned Implicit V-Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Conservative Goal-Conditioned Implicit Value Learning (CGCIVL), which mitigates value overestimation in cross-trajectory sampling by penalizing unconnected state-goal pairs.
Abstract: Offline goal-conditioned reinforcement learning (GCRL) learns a goal-conditioned value function to train policies for diverse goals with pre-collected datasets. Hindsight experience replay addresses the issue of sparse rewards by treating intermediate states as goals but fails to complete goal-stitching tasks where achieving goals requires stitching different trajectories. While cross-trajectory sampling is a potential solution that associates states and goals belonging to different trajectories, we demonstrate that this direct method degrades performance in goal-conditioned tasks due to the overestimation of values on unconnected pairs. To this end, we propose Conservative Goal-Conditioned Implicit Value Learning (CGCIVL), a novel algorithm that introduces a penalty term to penalize value estimation for unconnected state-goal pairs and leverages the quasimetric framework to accurately estimate values for connected pairs. Evaluations on OGBench, a benchmark for offline GCRL, demonstrate that CGCIVL consistently surpasses state-of-the-art methods across diverse tasks.
Lay Summary: When some tasks involve situations where the starting point and the goal are from different experiences, known as "goal-stitching" tasks, agents need to sample states and goals from different trajectories. However, this naive cross-trajectory sampling often leads to inaccurate value estimates. In our study, we found that the values of unconnected state-goal pairs may be overestimated, where "unconnected" means that no path in the dataset links the state and the goal. This happens because intermediate steps or transitions are missing from the dataset. Furthermore, the errors caused by overestimating values can spread to other state-goal pairs through bootstrapping (which occurs when an agent updates its values based on other estimates). To address this, we introduce a penalty in the learning process that prevents the overestimation of values for unconnected state-goal pairs. Our work is the first to highlight the value estimation issue caused by cross-trajectory sampling and offers a practical solution. This helps agents learn to combine different skills, like humans, to tackle more complex tasks.
Primary Area: Reinforcement Learning->Batch/Offline
Keywords: Offline Goal-Conditioned Reinforcement Learning; Goal-Stitching Tasks; Value Function Overestimation;
Submission Number: 4530
Loading