Stochastically Dominant Preference Optimization: Policy Improvement For All
Keywords: Reinforcement Learning from Human Feedback, Reinforcement Learning
Abstract: Reinforcement learning from human feedback (RLHF) optimizes policies based on users' rankings of output samples rather than using user-provided rewards, which are often hard to calibrate and reconcile. These methods typically assume users' underlying utility
functions are homogeneous and their rankings differ only due to noise. For heterogeneous user preferences, the majority's (or plurality's) utility is often prioritized at the expense of other users and/or consensus outputs are promoted that can be unappealing to all users. Motivated by an unknown underlying social social welfare function that balances users' competing preferences, we introduce stochastic dominance as a stricter guiding criteria for policy optimization that benefits all users. Our approach, stochastically dominant preference optimization (SDPO), avoids explicit reward function estimation while providing broad social welfare and individual performance improvement guarantees for users with diverse preferences. We demonstrate the empirical benefits of this approach when learning from users with heterogeneous preferences.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 1527
Loading