Keywords: Offline Safe Reinforcement Learning, In-Sample Learning, Constrainted Optimization
TL;DR: We propose the \textit{Region-Based Reward Penalty over Action Chunks} (R2PAC), a novel method that jointly trains the $h$-step optimal value function within the safe policy space.
Abstract: In-sample learning has emerged as a powerful paradigm that mitigates the Out-of-Distribution (OOD) issue, which leads to violations of safety constraints in offline safe reinforcement learning (OSRL). Existing approaches separately train reward and cost value functions, yielding \textit{suboptimal} policies within the safe policy space. To address this, we propose the \textit{Region-Based Reward Penalty over Action Chunks} (R2PAC), a novel method that trains $h$-step optimal value function within the safe policy space. By penalizing reward signals over action chunks that may potentially lead to unsafe transitions, our method: (1) integrates cost constraints into reward learning for constrained return maximization; (2) improves joint training stability by accelerating the convergence speed with unbiased multi-step value estimation; (3) effectively avoid unsafe states through temporally consistent behaviors. Extensive experiments on the DSRL benchmark demonstrate that our method outperforms state-of-the-art algorithms, achieving the highest returns in 13 out of 17 tasks while maintaining the normalized cost below a strict threshold in all tasks. The proposed method can be used as a drop-in replacement within existing offline RL pipelines.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 17200
Loading