Alleviating Shifted Distribution in Human Preference Alignment through Meta-Learning

Published: 01 Jan 2025, Last Modified: 01 Aug 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The capability of the reward model (RM) is crucial for the success of Reinforcement Learning from Human Feedback (RLHF) in aligning with human preferences. However, as training progresses, the output space distribution of the policy model shifts. The RM, initially trained on responses sampled from the output distribution of the early policy model, gradually loses its ability to distinguish between responses from the newly shifted distribution. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a novel method leveraging meta-learning to adapt the RM to the shifted environment distribution. MetaRM optimizes the RM in an alternating way, by preserving both the preferences of the original preference pairs, as well as maximizing discrimination power over new examples of the shifted distribution. Extensive experiments demonstrate that MetaRM can iteratively enhance the performance of human preference alignment by improving the RM's capacity to identify subtle differences in samples of shifted distributions.
Loading