Track: Research Track
Keywords: alignment, rhlf, preference optimization, game theory, human feedback, test-time improvement
Abstract: We propose Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This formulation departs from prior approaches such as Reinforcement Learning from Human Feedback (RLHF), which rely on assigning a scalar reward value to each action, and Nash Learning from Human Feedback (NLHF), which seek to compute a Nash equilibrium. SLHF decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. The sequential structure of SLHF naturally enables test-time improvement, as the Follower learns to refine the Leader’s actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Our experiments demonstrate that SLHF effectively aligns large language models with diverse, potentially intransitive, human preferences, and its test-time improvement generalizes across models without further training.
Submission Number: 52
Loading