High-Confidence Policy Improvement from Human Feedback

Hon Tik Tse; Philip S. Thomas; Scott Niekum

High-Confidence Policy Improvement from Human Feedback

Hon Tik Tse, Philip S. Thomas, Scott Niekum

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning from Human Feedback, High Confidence Policy Improvement, Imitation Learning & Inverse Reinforcement Learning, Reinforcement Learning

TL;DR: We propose an algorithm to perform high-confidence policy improvement in the reinforcement learning from human feedback setting.

Abstract: Reinforcement learning from human feedback (RLHF) aims to learn or fine-tune policies via human preference data when a ground-truth reward function is not known. However, conventional RLHF methods provide no performance guarantees and have an unacceptably high probability of returning poorly performing policies. We propose Policy Optimization and Safety Test for Policy Improvement (POSTPI), an algorithm that provides high-confidence policy performance guarantees without direct knowledge of the ground-truth reward function, given only a preference dataset. The user of the algorithm may select any initial policy $\pi_\text{init}$ and confidence level $1 - \delta$, and POSTPI will ensure that the probability it returns a policy with performance worse than $\pi_\text{init}$ under the unobserved ground-truth reward function is at most $\delta$. We show theory as well as empirical results in the Safety Gymnasium suite that demonstrate that POSTPI reliably provides the desired guarantee.

Submission Number: 156

Loading