Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees

Yaswanth Chittepu; Blossom Metevier; Will Schwarzer; Austin Hoag; Scott Niekum; Philip S. Thomas

Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees

Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Austin Hoag, Scott Niekum, Philip S. Thomas

Published: 09 May 2025, Last Modified: 20 Oct 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: rlhf, safety, helpfulness, value alignment, high confidence guarantees

TL;DR: We propose an algorithm that provides probabilistic safety guarantees on the model responses for models trained using RLHF

Abstract: Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness which can lead to unacceptable actions in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences regarding helpfulness and harmlessness (safety) and trains separate reward and cost models, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function while ensuring that a specific upper-confidence bound on the cost constraint is satisfied. In the second step, the trained model undergoes a safety test to verify whether its performance satisfies a separate upper-confidence bound on the cost constraint. We provide a theoretical analysis of HC-RLHF, including a proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa-3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability while also improving helpfulness and harmlessness compared to previous methods.

Submission Number: 62

Loading