A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Robust reinforcement learning (RL) under the average reward criterion, which seeks to optimize long-term system performance in uncertain environments, remains a largely unexplored area. To address this challenge, we propose a reduction-based framework that transforms robust average reward optimization into the more extensively studied robust discounted reward optimization by employing a specific discount factor. Our framework provides two key advantages. **Data Efficiency**: We design a model-based reduction algorithm that achieves near-optimal sample complexity, enabling efficient identification of optimal robust policies; **Scalability**: By bypassing the inherent challenges of scaling up average reward optimization, our framework facilitates the design of scalable, convergent algorithms for robust average reward optimization leveraging function approximation. Our algorithmic design, supported by theoretical and empirical analyses, provides a concrete solution to robust average reward RL with the first data efficiency and scalability guarantees, highlighting the framework’s potential to optimize long-term performance under model uncertainty in practical problems.
Lay Summary: For complex systems such as self-driving cars or automated financial portfolio management, each decision, i.e. turning the car down a road or buying/selling a security, must add up to a strong “long-term average” performance. In simulation-based settings, such as video games, reinforcement learning typically performs well due to similarities between training and testing environments, but in more practical, real-world settings, we observe a deterioration in performance commonly referred to as the “Sim-to-Real” gap. This can result from a myriad of factors like modeling errors or environmental perturbations. To address this, our work shows a way to convert the difficult “long-term average” goal into the more manageable “discounted” step-by-step formulation by finding a specific discount factor. This allows us to find robust strategies to optimally trade off the immediate and future rewards while accounting for various types of environmental uncertainty, i.e. safely driving to a location on a sunny day versus on a rainy day. Our method learns robust, long-term strategies that optimize the average reward in a data-efficient manner by constructing a model of the environment as well as a scalable version that bypasses this step to tackle large-scale, complex problems.
Primary Area: Reinforcement Learning
Keywords: Robust Reinforcement Learning, Average Reward
Submission Number: 2662
Loading