Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang; Wenhao Zhan; Tengyang Xie; Jason D. Lee; Wen Sun; Akshay Krishnamurthy; Dylan J Foster

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J Foster

Published: 22 Jan 2025, Last Modified: 18 Mar 2025ICLR 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning Theory, Offline Reinforcement Learning, single-policy concentrability, pessimism, RLHF

TL;DR: We propose a new theoretical algorithm for offline alignment/RLHF, Chi-Squared Preference Optimization, which is simple---a one-line change to DPO---yet enjoys the strongest known provable guarantees.

Abstract: Language model alignment methods such as reinforcement learning from human feedback (RLHF) have led to impressive advances in language model capabilities, but are limited by a widely observed phenomenon known as *overoptimization*, where the quality of the language model degrades over the course of the alignment process. As the model optimizes performance on an offline reward model, it overfits to inaccuracies and drifts away from preferred responses covered by the data. To discourage such distribution shift, KL-regularization is widely employed in existing offline alignment methods, but overoptimization continues to harm performance. Lending theoretical insight into the source of these empirical observations, we first show that the KL-regularization is too weak to prevent overfitting, then ask: is it possible to design an efficient algorithm that is provably robust to overoptimization? In this paper, we advance theoretical understanding of sample-efficient offline alignment and introduce a new algorithm called $\chi^2$-Preference Optimization ($\chi$PO). $\chi$PO is a one-line change to Direct Preference Optimization (DPO; Rafailov et al. 2023), that modifies only the logarithmic link function in the DPO objective. Despite this minimal change, $\chi$PO implicitly implements the principle of *pessimism in the face of uncertainty* via regularization with the $\chi^2$-divergence---which quantifies uncertainty more effectively than KL-regularization---and provably alleviates overoptimization, achieving sample-complexity guarantees based on *single-policy concentrability*, the gold standard in offline reinforcement learning. This guarantee makes $\chi$PO the first simple, yet general-purpose offline alignment algorithm that is provably robust to overoptimization.

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11056

Loading