Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: direct preference optimization, human preference alignment, Regularization
TL;DR: an importance-sampling-based method to mitigate over-optimization in Direct Alignment Algorithms for language model alignment
Abstract: Recently, Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. Surprisingly, while DAAs do not use a separate proxy reward model as in RLHF, their performance can still deteriorate over the course of training -- an over-optimization phenomenon found in RLHF where the learning policy exploits the overfitting to inaccuracies of the reward model to achieve high rewards. One attributed source of over-optimization in DAAs is the under-constrained nature of their offline optimization, which can gradually shift probability mass toward non-preferred responses not presented in the preference dataset. This paper proposes a novel importance-sampling approach to mitigate the distribution shift problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 24413
Loading