TL;DR: Use a parameter alpha to control the shape of the reward function for alignment training, in order to improve performance
Abstract: Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15\% to 50\% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.
Lay Summary: Large language models are often fine-tuned to follow human instructions using methods like Reinforcement Learning with Human Feedback (RLHF), but newer Direct Alignment Algorithms (DAAs) such as DPO and SimPO skip the separate reward-modeling step and directly optimize for human preferences—sometimes at the cost of reducing the model’s likelihood of generating preferred responses, a problem known as likelihood displacement .
This paper introduces AlphaPO, a simple yet powerful tweak: it adds a tunable parameter α to reshape the reward function itself, allowing precise control over how aggressively the model shifts probability mass toward preferred outputs without overshooting or under‐optimizing . By varying α, AlphaPO produces training trajectories that better balance margin improvement against maintaining high preferred-response probabilities, effectively mitigating both over-optimization and catastrophic likelihood displacement .
In experiments on state-of-the-art 7–8 billion-parameter instruct models, AlphaPO boosts alignment performance by 7–10 % relative to SimPO and by 15–50 % relative to DPO—without longer or more verbose outputs—highlighting that the shape of the reward function is a crucial, previously underexplored knob for aligning LLMs to human values .
Primary Area: Deep Learning->Large Language Models
Keywords: llm, large language models, deep learning, alignment, preference tuning, post training, reward shaping
Submission Number: 13723
Loading