Keywords: Reinforcement Learning from Human Feedback, High Confidence Policy Improvement, Imitation Learning & Inverse Reinforcement Learning, Reinforcement Learning
TL;DR: We propose an algorithm to perform high-confidence policy improvement in the reinforcement learning from human feedback setting.
Abstract: Reinforcement learning from human feedback (RLHF) aims to learn or fine-tune policies via human preference data when a ground-truth reward function is not known. However, conventional RLHF methods provide no performance guarantees and have an unacceptably high probability of returning poorly performing policies. We propose Policy Optimization and Safety Test for Policy Improvement (POSTPI), an algorithm that provides high-confidence policy performance guarantees without direct knowledge of the ground-truth reward function, given only a preference dataset. The user of the algorithm may select any initial policy $\pi_\text{init}$ and confidence level $1 - \delta$, and POSTPI will ensure that the probability it returns a policy with performance worse than $\pi_\text{init}$ under the unobserved ground-truth reward function is at most $\delta$. We show theory as well as empirical results in the Safety Gymnasium suite that demonstrate that POSTPI reliably provides the desired guarantee.
Submission Number: 156
Loading