Near-Optimal Distributionally Robust Reinforcement Learning with General $L_p$ Norms

Pierre Clavier; Laixi Shi; Erwan Le Pennec; Eric Mazumdar; Adam Wierman; Matthieu Geist

Near-Optimal Distributionally Robust Reinforcement Learning with General $L_p$ Norms

Pierre Clavier, Laixi Shi, Erwan Le Pennec, Eric Mazumdar, Adam Wierman, Matthieu Geist

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robust Markov decision process, norm, sample complexity, strong duality

TL;DR: To address the challenges of robustness and sample efficiency in reinforcement learning (RL), this work studies the sample complexity of distributionally robust Markov decision processes (RMDPs).

Abstract: To address the challenges of sim-to-real gap and sample efficiency in reinforcement learning (RL), this work studies distributionally robust Markov decision processes (RMDPs) --- optimize the worst-case performance when the deployed environment is within an uncertainty set around some nominal MDP. Despite recent efforts, the sample complexity of RMDPs has remained largely undetermined. While the statistical implications of distributional robustness in RL have been explored in some specific cases, the generalizability of the existing findings remains unclear, especially in comparison to standard RL. Assuming access to a generative model that samples from the nominal MDP, we examine the sample complexity of RMDPs using a class of generalized $L_p$ norms as the 'distance' function for the uncertainty set, under two commonly adopted $sa$-rectangular and $s$-rectangular conditions. Our results imply that RMDPs can be more sample-efficient to solve than standard MDPs using generalized $L_p$ norms in both $sa$- and $s$-rectangular cases, potentially inspiring more empirical research. We provide a near-optimal upper bound and a matching minimax lower bound for the $sa$-rectangular scenarios. For $s$-rectangular cases, we improve the state-of-the-art upper bound and also derive a lower bound using $L_\infty$ norm that verifies the tightness.

Primary Area: Reinforcement learning

Submission Number: 7743

Loading