Provably Sample-Efficient Robust  Reinforcement Learning with Average Reward

Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

ICLR 2026 Conference Submission21442 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robust Reinforcement Learning, Average Reward

Abstract: Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its specification. However, a significant gap exists in understanding the finite-sample complexity of these methods, as most existing work provides only asymptotic guarantees. This limitation hinders their principled understanding and practical deployment, especially in data-limited scenarios. We close this gap by proposing **Robust Halpern Iteration (RHI)**, a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $\ell_p$-norm and contamination models. Our approach offers three key advantages over previous methods: *(1). Weaker Structural Assumptions:* RHI only requires the underlying robust MDP to be communicating, a less restrictive condition than the commonly assumed ergodicity or irreducibility; *(2). No Prior Knowledge:* Our algorithm operates without requiring any prior knowledge of the robust MDP; *(3). State-of-the-Art Sample Complexity:* To learn an $\epsilon$-optimal robust policy, RHI achieves a sample complexity of $\tilde{\mathcal O}\left(\frac{SA\mathcal H^{2}}{\epsilon^{2}}\right)$, where $S$ and $A$ denote the numbers of states and actions, and $\mathcal H$ is the robust optimal bias span. This result represents the tightest known bound. Our work hence provides essential theoretical understanding of sample efficiency of robust average reward RL.

Primary Area: reinforcement learning

Submission Number: 21442

Loading