Distortion of AI Alignment Revisited: RLHF Is a Decent Utilitarian Aligner

Published: 02 Mar 2026, Last Modified: 17 Apr 2026AFAA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Main Papers Track (6 to 9 pages)
Keywords: AI alignment, social choice, distortion, RLHF, PPO
Abstract: While Reinforcement Learning from Human Feedback (RLHF) is the standard paradigm for aligning large language models with human preferences, its effectiveness in pluralistic settings has been called into question. Notably, recent work by Golz et al. (2025) demonstrated that the *distortion* — defined as the multiplicative gap between the average user utility of the RLHF policy and the optimal average utility — can scale exponentially with the Bradley-Terry temperature parameter $\beta$ when users have heterogeneous preferences. In this work, we present a fine-grained analysis of the distortion of RLHF with reward clipping and demonstrate that such exponential degradation is not a fundamental property of the algorithm but rather a consequence of distribution mismatch between the distribution generating preference data ($\mu$) and the KL reference policy ($\pi_{\mathrm{ref}}$). To this end, we establish tight upper and lower bounds on the distortion of RLHF across multiple regimes of the KL regularization strength. We show that in a representative regime, under the Bradley-Terry model, the distortion is $\tilde{\Theta}(\beta B)$, where $B$ is an upper bound on the log density ratio between $\mu$ and $\pi_{\mathrm{ref}}$. In particular, when there is no distribution mismatch (i.e., $\mu = \pi_{\mathrm{ref}}$), RLHF achieves the optimal distortion of $O(\beta)$ up to a constant. Our results suggest that, to reasonably maximize average utility with RLHF, it is preferable to use on-policy sampled preference data or to fine-tune models on samples from $\mu$ prior to RLHF.
Submission Number: 11
Loading