PILAF: Optimal Human Preference Sampling for Reward Modeling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a novel sampling scheme for preference labeling that leads to better RLHF.
Abstract: As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.
Lay Summary: As artificial intelligence (AI) increasingly influences real-world applications, it becomes essential to ensure AI systems align with human values. However, accurately capturing these values is challenging because human preferences aren't always easy to measure or clearly defined. A common approach, Reinforcement Learning from Human Feedback (RLHF), tries to teach AI what humans prefer by creating models based on feedback, but this feedback isn't always precise enough to guide AI effectively. Our research introduces Policy-Interpolated Learning for Aligned Feedback (PILAF), a new response sampling method designed to improve how AI learns from human preferences. PILAF helps ensure that AI behavior closely matches the true values humans have in mind, even if direct measures of these values are unavailable. By integrating the process of selecting responses for human feedback more effectively, PILAF ensures the learning method consistently moves toward better alignment with actual human preferences. This approach is both theoretically sound and easy to implement, showing notable improvements in scenarios where efficient and accurate human feedback is essential, thereby making AI systems safer and more reliable.
Primary Area: Deep Learning->Large Language Models
Keywords: Reinforcement Learning from Human Feedback, RLHF, Sampling Scheme, Preference Labeling, Optimization
Submission Number: 2262
Loading