Keywords: Human Feedback, Pluralistic reward model
TL;DR: Identified a weak reward model assumption that allows for pluralistic reward and provably efficient learning.
Abstract: Aligning AI systems with diverse human values requires learning from feedback that may be pluralistic and inconsistent: different individuals or groups can rank the same options differently depending on the context. In this paper, we provide theoretical foundations for this challenge in the framework of combinatorial bandits with semi-bandit feedback, where an online learner selects a size-$k$ set of items at each step and observes set-dependent, potentially inconsistent per-item rewards. We present a simple structural assumption -- pluralistic reward inconsistency with structural monotonicity (PRISM) -- that formalizes when learning remains tractable despite inconsistent preferences. PRISM allows for intransitive and contradictory feedback, yet subsumes many widely used preference models (e.g., multinomial logit and random utility models). Most importantly, we prove that under PRISM a simple UCB-based algorithm finds the optimal set and achieves $O\left(\min(\frac{k^3 n \log T}{\epsilon}, k^2\sqrt{n T \log T})\right)$ regret, nearly matching our $\Omega(\frac{n\log T}{\epsilon})$ lower bound. Our results demonstrate that provably efficient learning from pluralistic human feedback is possible under mild structural conditions, providing a theoretical basis for the practical success of simple algorithms in the presence of value pluralism.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading