RSPO: Reward-Driven Selective Penalization for Preference Alignment Optimization

ACL ARR 2025 May Submission4849 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Preference optimization is a crucial research direction for aligning language models with human preferences. Direct Preference Optimization (DPO) has emerged as a novel approach, replacing the paradigm of Reinforcement Learning from Human Feedback (RLHF) with the direct optimization of preference reward functions. However, DPO treats all preference response pairs as equally important, regardless of their quality or complexity. This may inadvertently lead to suboptimal generalization, especially when the training dataset is dominated by noisy or ambiguous preference response pairs. In this paper, we propose a novel method called **R**eward-Driven **S**elective **P**enalization for Preference Alignment **O**ptimization (**RSPO**). RSPO for the first time proposes to dynamically categorize preference data based on implicit reward signals and apply selective weighting to different categories. Moreover, RSPO introduces a Penalty Weighting Strategy that dynamically evaluates data quality and adjusts optimization weights in real time, effectively tackling challenges posed by noisy and complex preference signals, thereby improving alignment performance. Our experiments demonstrate that RSPO achieves remarkable performance on both the Mistral-base and Llama3-Instruct models, outperforming DPO by an average of 4.55% on AlpacaEval 2, and surpassing recent methods such as SimPO and WPO, achieving state-of-the-art (SOTA) performance. Our code is available at https://anonymous.4open.science/r/RSPO.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: alignment, large language models, preference optimization
Languages Studied: English
Submission Number: 4849
Loading