Q-Supervised Contrastive Representation: A State Decoupling Framework for Safe Offline Reinforcement Learning

Zhihe Yang; Yunjian Xu; Yang Zhang

Q-Supervised Contrastive Representation: A State Decoupling Framework for Safe Offline Reinforcement Learning

Zhihe Yang, Yunjian Xu, Yang Zhang

Published: 01 May 2025, Last Modified: 10 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: To address the OOD issue during testing for safe offline RL, we propose the first framework that decouple the global observations into reward- and cost-related representations through Q-supervised contrastive learning for decision-making.

Abstract: Safe offline reinforcement learning (RL), which aims to learn the safety-guaranteed policy without risky online interaction with environments, has attracted growing recent attention for safety-critical scenarios. However, existing approaches encounter out-of-distribution problems during the testing phase, which can result in potentially unsafe outcomes. This issue arises due to the infinite possible combinations of reward-related and cost-related states. In this work, we propose State Decoupling with Q-supervised Contrastive representation (SDQC), a novel framework that decouples the global observations into reward- and cost-related representations for decision-making, thereby improving the generalization capability for unfamiliar global observations. Compared with the classical representation learning methods, which typically require model-based estimation (e.g., bisimulation), we theoretically prove that our Q-supervised method generates a coarser representation while preserving the optimal policy, resulting in improved generalization performance. Experiments on DSRL benchmark problems provide compelling evidence that SDQC surpasses other baseline algorithms, especially for its exceptional ability to achieve almost zero violations in more than half of the tasks, while the state-of-the-art algorithm can only achieve the same level of success in a quarter of the tasks. Further, we demonstrate that SDQC possesses superior generalization ability when confronted with unseen environments.

Lay Summary: Safe offline reinforcement learning (RL) focuses on teaching systems to make safe decisions without risky trial-and-error in real-world environments. However, existing methods often struggle when faced with unfamiliar situations during testing, which can lead to unsafe outcomes. This happens because of the infinite possible combinations of reward-related and cost-related states. To address this, we propose a noval framework called State Decoupling with Q-supervised Contrastive representation (SDQC). Our approach decouples the global observations into reward- and cost-related representations, making it easier for the system to handle new situations. Experiments show that SDQC outperforms other methods, achieving safety in more tasks and handling unseen environments better than the existing algorithms.

Primary Area: Reinforcement Learning->Batch/Offline

Keywords: Safe Reinforcement Learning, Offline Reinforcement Learning, Representation Learning, Contrastive Learning, Self-Supervised Learning

Submission Number: 6478

Loading