Proactive Cost Generation for Offline Safe Reinforcement Learning Without Unsafe Data

Ruiqi Xue; Lei Yuan; Kainuo Cheng; Jing-Wen Yang; Yang Yu

Proactive Cost Generation for Offline Safe Reinforcement Learning Without Unsafe Data

Ruiqi Xue, Lei Yuan, Kainuo Cheng, Jing-Wen Yang, Yang Yu

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Offline Reinforcement Learning, Safe Reinforcement Learning

TL;DR: We propose PROCO, an offline safe RL algorithm capable of identifying infeasible states and learning safe policies from datasets with few or no unsafe samples.

Abstract: Learning constraint-satisfying policies from offline data without risky online interaction is essential for safety-critical decision making. Conventional approaches typically learn cost value functions from large numbers of unsafe samples in order to delineate the safety boundary and penalize potentially violating actions. In many high-stakes settings, however, risky trial-and-error is unacceptable, resulting in offline data that contains few, if any, unsafe samples. Under this data limitation, existing approaches tend to treat all samples as uniformly safe, neglecting the substantial presence of safe-but-infeasible states—states that are currently constraint-satisfying but inevitably violate safety constraints within a few future steps—thereby resulting in deployment failures. To overcome this challenge, we present PROCO, a proactive offline safe RL framework tailored to datasets largely devoid of violations. PROCO first learns a dynamics model from the offline data to capture environment information. It then constructs a conservative cost signal by grounding natural-language descriptions of unsafe states in large language models (LLMs), yielding risk assessments even when violations are unobserved. Finally, PROCO performs model-based rollouts under this cost to synthesize diverse and informative counterfactual unsafe samples, which in turn enable reliable feasibility identification and support feasibility-guided policy learning. Across diverse Safety-Gymnasium tasks, PROCO consistently reduces constraint violations and improves safety relative to conventional offline safe RL and behavior cloning baselines when training data contains only safe or minimally risky samples.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 10180

Loading