Influence-based Online Experience Selection for Efficient RLHF

Influence-based Online Experience Selection for Efficient RLHF

ACL ARR 2025 February Submission7543 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The alignment of Large Language Models (LLMs) with human preferences currently hinges on Reinforcement Learning from Human Feedback (RLHF). However, RL-based alignment methods often suffer from poor sample efficiency, slow and unstable convergence, and a tendency to learn unintended strategies, making it challenging to achieve intended alignment objectives efficiently and stably. To address this challenge, we propose InfOES, a novel approach to control the optimization direction of the policy model through Influence-based Online Experience Selection. We introduce a metric to quantify the influence of individual experiences on a specific alignment objective in RLHF. Based on this, we develop a plug-and-play method that filters out experiences detrimental to alignment during the online RL process, thereby accelerating and stabilizing convergence toward the desired objective. Experimental results demonstrate that our method achieves superior alignment performance with fewer training experiences, offering a more effective and stable solution for aligning LLMs with human preferences.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning, data-efficient training, data influence

Contribution Types: Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 7543

Loading