Keywords: Offline PbRL, Safety Alignment
Abstract: Offline preference-based reinforcement learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, and then use constrained RL to optimize a safe policy. However, inaccuracies in the reward and cost learning can impair performance when used with constrained RL methods. To address these challenges, (a) we introduce a framework that learns a policy based on pairwise preferences regarding the agent’s behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments, without access to ground-truth rewards or costs; (b) we combine the preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved using a Lagrangian method that directly learns reward maximizing safe policy without explicitly learning reward and cost models, avoiding the need for constrained RL; (c) to evaluate our approach, we construct new datasets with synthetic human feedback, built upon a well-established offline safe RL benchmark. Empirically, our method successfully learns safe policies with high rewards, outperforming baselines with ground-truth reward and cost, as well as state-of-the-art RLHF approaches.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13817
Loading