On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Mudit Gaur; Utsav Singh; Amrit Singh Bedi; Raghu Pasupathy; Vaneet Aggarwal

On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Mudit Gaur, Utsav Singh, Amrit Singh Bedi, Raghu Pasupathy, Vaneet Aggarwal

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Bilevel Reinforcement Learning, Sample Complexity, Gradient Descent, Bilevel Optimization

TL;DR: Provide First Ever Sample complexity bounds for a first order bi-level algorithm.

Abstract: Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 19200

Loading