Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent.
This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization.
Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases, which can be asymmetric.
We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a binary classification problem.
Specifically, we demonstrate that asymmetric preference noise can be effectively treated as symmetric noise under this framework.
This viewpoint allows us to leverage symmetric losses, well known for their robustness to label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm.
Theoretically, we prove that symmetric losses enable successful policy improvement even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement.
Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate that SymPO achieves competitive or higher performance than existing robust methods in high-noise scenarios.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 7448
Loading