Reward Alignment Optimization: A Direct Point-wise Alignment Approach

Reward Alignment Optimization: A Direct Point-wise Alignment Approach

ACL ARR 2026 January Submission10704 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Alignment, Bradley-Terry

Abstract: Direct Alignment Algorithms (DAAs) such as DPO simplify RLHF by optimizing policies directly from preference pairs. However, the Bradley–Terry probability-gap objective can induce likelihood displacement and, under weak KL constraints, may even reduce the probability of preferred responses, while implicit rewards can be limited in generalizaiton.. We propose Reward Alignment Optimization (RAO), a point-wise direct alignment method that uses an explicit reward model to specify exact target generation probabilities and align the policy offline towards them. Our key insight is a theoretical principle we call "prefix consistency", which links the normalization terms of prompts that share a prefix. Leveraging this property, RAO decouples target reward differentials from bias terms, prevents decreasing preferred-response probabilities, and better exploits reward information both within and across prompts. Extensive experiments on multiple base LLMs show that RAO consistently outperforms existing DAAs while enabling controllable target probability distributions.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: optimization methods,generative models

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 10704

Loading