Keywords: game theory, RLHF, language models, stackelberg game
TL;DR: We propose Stackelberg Alignment Reinforcement Learning from Human Feedback (STA-RLHF), which formalises RLHF as a Stackelberg game between the language model and the reward model.
Abstract: The alignment problem, namely endowing language models with human preferences, is a key AI challenge.
Most RLHF approaches treat the optimization of the language model and the reward model as separate problems.
We propose Stackelberg Alignment Reinforcement Learning from Human Feedback (STA-RLHF), which formalises RLHF as a Stackelberg game between the language model and the reward model.
The leader in our game is the language model, which aligns its behavior with human preferences by optimizing a representation of those preferences called the reward model.
Meanwhile, the follower learns this representative reward model based on human feedback.
We devise a nested gradient-based algorithm that searches for a Stackelberg equilibrium of our game, and show that the ensuing language model outperforms other RLHF methods on a diverse set of synthetic tasks.
Submission Number: 15
Loading