Constraint-Aware Reward Relabeling for Offline Safe Reinforcement Learning

Constraint-Aware Reward Relabeling for Offline Safe Reinforcement Learning

ICLR 2026 Conference Submission22357 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Safe Reinforcement Learning, Safe Reinforcement Learning, Offline Reinforcement Learning, Reward Shaping

TL;DR: We introduce a reward relabeling method that enables high-return, constraint-satisfying policies in offline safe RL.

Abstract: Offline safe reinforcement learning (OSRL) considers the problem of learning reward-maximizing policies for a pre-defined cost constraint from a fixed dataset. This paper proposes a simple and effective approach referred to as Constraint-aware Reward (Re)Labeling (CARL), that can be wrapped around existing offline RL algorithms. CARL is an iterative approach that alternates between two steps for each sampled batch of data to ensure state-action-wise safety constraints. First, update cost evaluation function using an off-policy evaluation procedure. Second, update policy using relabeled rewards (assign large penalty) for state-action pairs which are detected unsafe based on cost estimates. CARL is a minimalist approach, doesn’t introduce any additional hyperparameters, and allows us to leverage strong off-the-shelf offline RL algorithms. Experimental results on the DSRL benchmark tasks demonstrate that CARL reliably enforces safety constraints under small cost budgets, while achieving high rewards.

Primary Area: reinforcement learning

Submission Number: 22357

Loading