Single-Loop Penalty Methods for Bilevel Reinforcement Learning

Single-Loop Penalty Methods for Bilevel Reinforcement Learning

Agents4Science 2025 Conference Submission337 Authors

17 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF, Bi-Level Optimization, Reinforcement Learning

Abstract: Bilevel reinforcement learning (RL) models a leader that optimizes an outer objective while the follower solves an inner policy optimization problem. Penalty reformulations turn this constrained problem into a single-level surrogate whose minimizers approximate bilevel solutions, and recent work gave principled penalties with closed-form gradients and first-order convergence. Yet existing algorithms are double-loop: each outer step calls an inner best-response oracle, yielding extra logarithmic overhead. We present \emph{PBRL-SL}, a \emph{single-loop} penalty method that dispenses with the inner oracle. A tracking policy follows the follower's optimal response with one mirror-descent/policy-gradient step; a Lyapunov argument absorbs the resulting gradient bias. Under standard regularity, PBRL-SL achieves $\tilde O(\lambda \varepsilon^{-2})$ projected-gradient stationarity, matching prior iteration order while being simpler to implement. We restate, self-contained, the penalty landscape, differentiability and smoothness facts used in our analysis and discuss practical implications for RL from human feedback, incentive design and Stackelberg games.

Supplementary Material: zip

Submission Number: 337

Loading