Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce \emph{preference embedding}, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.
Lay Summary: When we try to teach AI to understand human preferences, current methods often struggle, especially when our choices are complex or seem contradictory (like preferring rock over scissors, scissors over paper, but paper over rock). This makes it hard to build AI that truly aligns with our nuanced values. We developed a new approach called "preference embedding," where instead of assigning a simple score to each option, our "General Preference embedding Model" (GPM) maps options into a richer, multi-dimensional space. This allows the GPM to efficiently grasp these complex, even contradictory, preferences. We also use this deeper understanding to fine-tune AI language models through a process we call "General Preference Optimization" (GPO). Our GPM consistently outperforms older models, particularly in understanding contradictory preferences where others might behave like a random guess. Using it with GPO improves AI performance on tasks requiring sensitive alignment with human values. This work paves the way for AI that better reflects our subtle judgments.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/general-preference/general-preference-model
Primary Area: Theory->Deep Learning
Keywords: reinforcement learning from human feedback, preference modeling
Submission Number: 4002
Loading