Feedback-driven Behavioral Shaping for Safe Offline RL

18 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline RL, Cost Inference, Return Decomposition, Offline Safe RL
TL;DR: We propose a framework for offline safe RL that converts sparse trajectory stop feedback into dense per-step costs via return decomposition, yielding safer policies with fewer violations while preserving reward.
Abstract: Learning safe policies in offline reinforcement learning (RL) requires access to a cost function, but dense annotations are rarely available. In practice, experts typically provide only sparse supervision by truncating trajectories at the first unsafe action, leaving a single terminal cost label. We frame this challenge as a credit assignment problem: the agent must determine which earlier actions contributed to the violation to learn safer behavior. To address this, we propose an approach that redistributes sparse stop-feedback into dense per-step costs using return decomposition, and then integrates these inferred costs into constrained offline RL. Across highway driving and a simulated continuous control task, our method achieves substantially lower violation rates compared to baselines, while preserving reward performance.
Primary Area: reinforcement learning
Submission Number: 11234
Loading