Balancing Safety and Return: Region-based Reward Penalty over Action Chunks For Offline Safe RL

Balancing Safety and Return: Region-based Reward Penalty over Action Chunks For Offline Safe RL

ICLR 2026 Conference Submission17200 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Safe Reinforcement Learning, In-Sample Learning, Constrainted Optimization

TL;DR: We propose the \textit{Region-Based Reward Penalty over Action Chunks} (R2PAC), a novel method that jointly trains the $h$-step optimal value function within the safe policy space.

Abstract: In-sample learning has emerged as a powerful paradigm that mitigates the Out-of-Distribution (OOD) issue, which leads to violations of safety constraints in offline safe reinforcement learning (OSRL). Existing approaches separately train reward and cost value functions, yielding \textit{suboptimal} policies within the safe policy space. To address this, we propose the \textit{Region-Based Reward Penalty over Action Chunks} (R2PAC), a novel method that trains $h$-step optimal value function within the safe policy space. By penalizing reward signals over action chunks that may potentially lead to unsafe transitions, our method: (1) integrates cost constraints into reward learning for constrained return maximization; (2) improves joint training stability by accelerating the convergence speed with unbiased multi-step value estimation; (3) effectively avoid unsafe states through temporally consistent behaviors. Extensive experiments on the DSRL benchmark demonstrate that our method outperforms state-of-the-art algorithms, achieving the highest returns in 13 out of 17 tasks while maintaining the normalized cost below a strict threshold in all tasks. The proposed method can be used as a drop-in replacement within existing offline RL pipelines.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 17200

Loading