Keywords: AI Alignment, Reinforcement Learning From Human Feedback, Pluralistic Alignment, Preference Learning, Social choice theory
TL;DR: We present the first positive result on aggregating the expected utility of a heterogenous population in LLM alignment through a practical estimator with strong empirical performance and provable fast rates of convergence.
Abstract: Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a naïve probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility---a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions, without requiring explicit modeling of user heterogeneity, and achieves the first polynomial finite-sample error bounds in this setting. Using standard benchmark experiments and a new empirical methodology to assess the impact of heterogeneity, we find that the sign estimator substantially reduces preference distortion compared to standard RLHF. Specifically, it cuts disagreement with true population preferences from 12\% to 8\%, and reduces angular estimation error by nearly 35\%. Our empirical set-up leverages digital twins---LLMs calibrated to real-world US panelists---to simulate realistic population-level heterogeneity and obtain a ground truth alignment target for evaluating different estimators.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 23560
Loading