STA-RLHF: Stackelberg Aligned Reinforcement Learning with Human Feedback

Published: 01 Jun 2024, Last Modified: 26 Jul 2024CoCoMARL 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: game theory, RLHF, language models, stackelberg game
TL;DR: We propose Stackelberg Alignment Reinforcement Learning from Human Feedback (STA-RLHF), which formalises RLHF as a Stackelberg game between the language model and the reward model.
Abstract: The alignment problem, namely endowing language models with human preferences, is a key AI challenge. Most RLHF approaches treat the optimization of the language model and the reward model as separate problems. We propose Stackelberg Alignment Reinforcement Learning from Human Feedback (STA-RLHF), which formalises RLHF as a Stackelberg game between the language model and the reward model. The leader in our game is the language model, which aligns its behavior with human preferences by optimizing a representation of those preferences called the reward model. Meanwhile, the follower learns this representative reward model based on human feedback. We devise a nested gradient-based algorithm that searches for a Stackelberg equilibrium of our game, and show that the ensuing language model outperforms other RLHF methods on a diverse set of synthetic tasks.
Submission Number: 15
Loading