Keywords: AI alignment, reinforcement learning, democratic AI alignment, pluralistic AI alignment, computational social choice
Abstract: Currently the most powerful AI systems are aligned with human values via reinforcement learning from human feedback. Yet, reinforcement learning from human feedback models human preferences as noisy samples from a single linear ordering of shared human values and is unable to incorporate democratic AI alignment. In particular, the standard approach fails to represent and reflect diverse and conflicting perspectives of pluralistic human values. Recent research introduced the theoretically principled notion of quantile fairness for training a reinforcement learning policy in the presence of multiple, competing sets of values from different agents. Quite recent work provided an algorithm for achieving quantile fairness in the tabular setting with explicit access to the full set of states, actions and transition probabilities in the MDP. These current methods require solving linear programs with the size of the constraint set given by the number of states and actions, making it unclear how to translate this into practical training algorithms that can only take actions and observe individual transitions from the current state. In this paper, we design and prove the correctness of a new algorithm for quantile fairness that makes efficient use of standard policy optimization as a black-box without any direct dependence on the number of states or actions. We further empirically validate our theoretical results and demonstrate that our algorithm achieves competitive fairness guarantees to the prior work, while being orders of magnitude more efficient with respect to computation and the required number of samples. Our algorithm opens a new avenue for provable fairness guarantees in any setting where standard policy optimization is possible.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19993
Loading