Probability-based Reward Value Combination Method for Multi-Objective Alignment

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: value alignment, RLHF
TL;DR: Proposes a Bradley-Terry-based reward method for multi-objective RLHF that reduces preference interference and scale issues, improving alignment performance.
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a fundamental approach for aligning large language models (LLMs) with human values. While alignment with a single preference has become relatively mature, current Multi-Objective RLHF (MORLHF) pipelines still face several challenges, such as interference among preference signals, scale inconsistencies, and high sensitivity to hyperparameters. These limitations hinder the scalability and stability of MORLHF. Taking the Bradley-Terry (BT) model as the mathematical foundation for reward modeling, we analyze how existing linear reward combination methods distort its preference probability structure and identify the root causes of signal interference across different preferences. To address these challenges, we propose an improved reward computation method that utilizes BT preference probabilities and comparison samples to construct a unified reward signal for multi-objective alignment. Our approach preserves the BT probabilistic structure, harmonizes the scale across diverse preferences, reduces signal interference, and enables more effective use of additional generated samples—leading to superior performance gains as the number of samples increases. Moreover, our method generalizes to various RLHF algorithms, including PPO and GRPO. Experimental results on safety alignment tasks show that our approach facilitates the training of LLMs aligned with diverse human preferences, achieving a stronger Pareto frontier than existing methods and yielding greater improvements as sample generation scales.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23409
Loading