Keywords: Offline RL, Cost Inference, Return Decomposition, Offline Safe RL
TL;DR: We propose a framework for offline safe RL that converts sparse trajectory stop feedback into dense per-step costs via return decomposition, yielding safer policies with fewer violations while preserving reward.
Abstract: Learning safe policies in offline reinforcement learning (RL) requires access to a cost function, but dense annotations are rarely available. In practice, experts typically provide only sparse supervision by truncating trajectories at the first unsafe action, leaving a single terminal cost label. We frame this challenge as a credit assignment problem: the agent must determine which earlier actions contributed to the violation to learn safer behavior. To address this, we propose an approach that redistributes sparse stop-feedback into dense per-step costs using return decomposition, and then integrates these inferred costs into constrained offline RL. Across highway driving and a simulated continuous control task, our method achieves substantially lower violation rates compared to baselines, while preserving reward performance.
Primary Area: reinforcement learning
Submission Number: 11234
Loading