On Symmetric Losses for Policy Optimization with Noisy Preferences

Soichiro Nishimori; Yu-Jie Zhang; Thanawat Lodkaew; Masashi Sugiyama

On Symmetric Losses for Policy Optimization with Noisy Preferences

Soichiro Nishimori, Yu-Jie Zhang, Thanawat Lodkaew, Masashi Sugiyama

Published: 27 Apr 2026, Last Modified: 27 Apr 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases, which can be asymmetric. We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a binary classification problem. Specifically, we demonstrate that asymmetric preference noise can be effectively treated as symmetric noise under this framework. This viewpoint allows us to leverage symmetric losses, well known for their robustness to label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm. Theoretically, we prove that symmetric losses enable successful policy improvement even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement. Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate that SymPO achieves competitive or higher performance than existing robust methods in high-noise scenarios.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: **Sec. 5** - Describe the assumption of exact minimizer in Theorem 2. **Sec. 6.2** - Moved table with standard deviations from Appendix.

Code: https://github.com/nissymori/SymPO

Assigned Action Editor: ~Vimal_Thilak2

Submission Number: 7448

Loading