C2IQL: Constraint-Conditioned Implicit Q-learning for Safe Offline Reinforcement Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Safe offline reinforcement learning aims to develop policies that maximize cumulative rewards while satisfying safety constraints without the need for risky online interaction. However, existing methods often struggle with the out-of-distribution (OOD) problem, leading to potentially unsafe and suboptimal policies. To address this issue, we first propose Constrained Implicit Q-learning (CIQL), a novel algorithm designed to avoid the OOD problem. In particular, CIQL expands the implicit update of reward value functions to constrained settings and then estimates cost value functions under the same implicit policy. Despite its advantages, the further performance improvement of CIQL is still hindered by the inaccurate discounted approximations of constraints. Thus, we further propose Constraint-Conditioned Implicit Q-learning (C2IQL). Building upon CIQL, C2IQL employs a cost reconstruction model to derive non-discounted cumulative costs from discounted values and incorporates a flexible, constraint-conditioned mechanism to accommodate dynamic safety constraints. Experiment results on DSRL benchmarks demonstrate the superiority of C2IQL compared to baseline methods in achieving higher rewards while guaranteeing safety constraints under different threshold conditions.
Lay Summary: In safe offline reinforcement learning (SORL), a key challenge is to maximize rewards while ensuring constraint satisfaction, all from a pre-collected dataset. This difficulty is raised by the out-of-distribution (OOD) problem, where actions not in the dataset can be wrongly estimated during the Bellman backup process, leading to unsafe and suboptimal policies, especially in safety-sensitive applications. To address this, we developed Constraint-Conditioned Implicit Q-learning (C2IQL), which updates policies and value functions entirely within the dataset in constrained settings. This procedure is achieved based on expectile regression without querying any action out of the dataset. Additionally, we introduce a cost reconstruction model alongside a constraint-conditioned mechanism to ensure accurate and dynamic adherence to safety constraints. This research significantly advances the reliability and effectiveness of SORL by addressing the OOD problem completely. Besides, our findings also highlight a critical gap between non-discounted cost constraints and discounted value formulations in RL and provides a cost reconstruction model to address this gap.
Primary Area: Reinforcement Learning->Batch/Offline
Keywords: safe offline reinforcement learning, constrained implicit Q-learning, cost reconstruction, constraint-conditioned ability
Submission Number: 4291
Loading