Abstract: To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms.
However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.
Lay Summary: When we teach AI systems to act by learning from human feedback, current approaches often assume the examples people provide are perfectly optimal—and that can lead the AI to learn the wrong lessons.
We introduce policy-labeled preference learning (PPL), which treats each human example as coming from some real behavior and measures how much “regret” a person had—i.e., how they might wish they’d acted differently—instead of assuming perfection. We also add a new “contrastive” adjustment that sharpens the AI’s understanding of which choices are truly preferred.
In challenging simulated control tasks, PPL yields policies that perform much better both offline and when learning on the fly. This brings us closer to AI systems that reliably follow the goals we actually care about.
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reinforcement Learning from Human Feedback, Direct Preference Optimization, Regret Minimization
Submission Number: 14278
Loading