Direct Reward Optimization: A Point-wise Alignment Approach

ICLR 2026 Conference Submission25193 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment Algorithms, Large Language Models, Bradley-Terry
Abstract: Direct Alignment Algorithms (DAAs) are widely used for aligning Large Language Models (LLMs) with human preferences. The current DAAs mostly use pairwise optimization objectives based on variants of Direct Preference Optimization (DPO). However, these methods only focus on the pairwise differences of the samples and cannot prevent the optimization from reducing the probabilities of preferred responses. To tackle the problem, in this paper, we propose Direct Reward Optimization (DRO), an algorithm that uses an explicit reward model to optimize the policy by setting an exact probability target for each response. DRO decouples the target reward differentials and bias in aligning objectives and utilizes the relationships not only within but also among the response pairs. Extensive experiments show that DRO outperforms the existing methods while providing control over the policy response probability.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 25193
Loading