Keywords: alignment, rhlf, preference optimization, game theory, human feedback, test-time improvement
TL;DR: Novel game theoretic approach to RLHF framing LLM alignment as a two-player sequential game.
Abstract: We propose Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This formulation departs from prior approaches such as Reinforcement Learning from Human Feedback (RLHF), which rely on real-valued reward models, and Nash Learning from Human Feedback (NLHF), which seek to compute a Nash equilibrium. The sequential structure of SLHF naturally enables test-time improvement, as the Follower learns to best respond to the Leader’s action. We compare the solution concepts of SLHF, RLHF and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Our experiments demonstrate that SLHF effectively aligns large language models with diverse, potentially intransitive, human preferences, and its test-time improvement generalizes across models without further training.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Barna_Pásztor1
Track: Regular Track: unpublished work
Submission Number: 58
Loading