Abstract: The alignment of Large Language Models (LLMs) with human preferences currently hinges on Reinforcement Learning from Human Feedback (RLHF). However, RL-based alignment methods often suffer from poor sample efficiency, slow and unstable convergence, and a tendency to learn unintended strategies, making it challenging to achieve intended alignment objectives efficiently and stably. To address this challenge, we propose InfOES, a novel approach to control the optimization direction of the policy model through Influence-based Online Experience Selection. We introduce a metric to quantify the influence of individual experiences on a specific alignment objective in RLHF. Based on this, we develop a plug-and-play method that filters out experiences detrimental to alignment during the online RL process, thereby accelerating and stabilizing convergence toward the desired objective. Experimental results demonstrate that our method achieves superior alignment performance with fewer training experiences, offering a more effective and stable solution for aligning LLMs with human preferences.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning, data-efficient training, data influence
Contribution Types: Approaches low compute settings-efficiency, Data analysis
Languages Studied: English
Submission Number: 7543
Loading