Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

ICLR 2026 Conference Submission14366 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Alignment, Population-Proportional Alignment, Social Choice Theory, Axiomatic Framework, Rank Aggregation, Pluralistic Alignment, Preference-based Reinforcement Learning, Reinforcement Learning from Human Feedback, Nash Learning from Human Feedback, Large Language Model

TL;DR: To address bias and manipulability issues in RLHF and NLHF, we propose a novel preference learning framework grounded in social choice theory that achieves proportional alignment with true population distribution of evaluator preferences.

Abstract: Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14366

Loading