A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: bi-level optimization, reinforcement learning, actor-critic algorithm
Abstract: We study a structured bi-level optimization problem where the upper-level objective is a generic smooth function, and the lower-level problem corresponds to policy optimization in a Markov Decision Process (MDP). The decision variable at the upper level parameterizes the reward function of the lower-level MDP, and the upper-level objective is evaluated based on the optimal policy induced by this reward. Such formulations naturally arise in contexts such as reward shaping and reinforcement learning (RL) from human feedback. Solving this bi-level problem is challenging due to the non-convexity of the lower-level objective and the difficulty of estimating the upper-level hyper-gradient. Existing methods often rely on second-order information, impose strong regularization on the lower-level RL problem, and/or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the upper-level objective via a penalty-based reformulation. The algorithm introduces into the lower-level RL objective an entropy regularization with decaying weight, which enables asymptotically unbiased upper-level hyper-gradient estimation without requiring the solution of the exact unregularized lower-level RL problem. Our main contribution is to establish the finite-time and finite-sample convergence of the proposed algorithm to the original, unregularized bi-level optimization problem. We support the theoretical results and numerically validate our method’s convergence through simulations in synthetic environments.
Submission Number: 20
Loading