Keywords: RLHF, Bi-Level Optimization, Reinforcement Learning
Abstract: Bilevel reinforcement learning (RL) models a leader that optimizes an outer objective while the follower solves an inner policy optimization problem. Penalty reformulations turn this constrained problem into a single-level surrogate whose minimizers approximate bilevel solutions, and recent work gave principled penalties with closed-form gradients and first-order convergence. Yet existing algorithms are double-loop: each outer step calls an inner best-response oracle, yielding extra logarithmic overhead. We present \emph{PBRL-SL}, a \emph{single-loop} penalty method that dispenses with the inner oracle. A tracking policy follows the follower's optimal response with one mirror-descent/policy-gradient step; a Lyapunov argument absorbs the resulting gradient bias. Under standard regularity, PBRL-SL achieves $\tilde O(\lambda \varepsilon^{-2})$ projected-gradient stationarity, matching prior iteration order while being simpler to implement. We restate, self-contained, the penalty landscape, differentiability and smoothness facts used in our analysis and discuss practical implications for RL from human feedback, incentive design and Stackelberg games.
Supplementary Material: zip
Submission Number: 337
Loading