Keywords: Social choice, reinforcement learning from human feedback, AI alignment
Abstract: Reinforcement learning with human preference feedback is the gold-standard approach for making current AI systems helpful, safe, and aligned with human values. Recent research has demonstrated that there is a tight connection between the objective functions used for alignment with human preferences, and voting rules from social choice theory that aggregate diverse preferences. This connection provides a principled way to study the advantages and disadvantages of a given alignment objective by analyzing the social-choice theoretic properties of the corresponding voting rule. Prior work in this direction has focused on variants of standard alignment objective functions, and connected them with well-known social choice rules such as the Borda count and von Neumann winner rules. However, practical alignment algorithms typically perform regularization to a reference policy in order to maintain the capabilities from pre-training. Such regularization could potentially distort the objective and hence change the social-choice theoretic properties of the corresponding voting rule. To address this question, we study the effect of regularization on the social-choice rules corresponding to standard alignment methods, and discover that in the case of the alignment objective corresponding to the von Neumann winner, regularization strictly improves the social-choice theoretic properties of the rule. At the same time, we prove that the standard RLHF objective, which corresponds to the Borda count rule, offers no such improvement and indeed has clear social-choice theoretic drawbacks compared to the von Neumann winner. Taken together, our results provide principled justification from social choice theory to use the von Neumann winner objective for practical alignment with human preferences.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20006
Loading