Feedback-driven Behavioral Shaping for Safe Offline RL

Ebenezer Gelo; Geraud Nangue Tasse; Benjamin Rosman; Steven James

Feedback-driven Behavioral Shaping for Safe Offline RL

Ebenezer Gelo, Geraud Nangue Tasse, Benjamin Rosman, Steven James

18 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline RL, Cost Inference, Return Decomposition, Offline Safe RL

TL;DR: We propose a framework for offline safe RL that converts sparse trajectory stop feedback into dense per-step costs via return decomposition, yielding safer policies with fewer violations while preserving reward.

Abstract: Learning safe policies in offline reinforcement learning (RL) requires access to a cost function, but dense annotations are rarely available. In practice, experts typically provide only sparse supervision by truncating trajectories at the first unsafe action, leaving a single terminal cost label. We frame this challenge as a credit assignment problem: the agent must determine which earlier actions contributed to the violation to learn safer behavior. To address this, we propose an approach that redistributes sparse stop-feedback into dense per-step costs using return decomposition, and then integrates these inferred costs into constrained offline RL. Across highway driving and a simulated continuous control task, our method achieves substantially lower violation rates compared to baselines, while preserving reward performance.

Primary Area: reinforcement learning

Submission Number: 11234

Loading