Keywords: Pluralistic Alignment, LLMs, Robustness, Best-of-K
TL;DR: We propose a group robust alignment objective in a flexible inference time pluralistic alignment setting.
Abstract: The desirable behaviour of a chat agent can be described with multiple criteria, such as harmlessness, helpfulness, and conciseness, each of which can be scored by a reward model. While each user, or a group of users, may perceive each criterion with different significance, in pluralistic alignment settings, it is difficult to know how much an individual user or group would weigh one criterion over another in many practical scenarios. Instead of assuming knowledge of the weights among multiple criteria, we propose a robust alignment approach that maximises the worst-case criterion among the group of reward models. To test this approach, we use best-of-K rejection sampling to demonstrate the properties of an algorithm that employs our robust objective. Finally, we propose several interesting avenues of future exploration that may lead to more practical algorithms than group robust best-of-K rejection sampling.
Submission Number: 70
Loading