Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Cheng Tang; Zhishuai Liu; Pan Xu

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Cheng Tang, Zhishuai Liu, Pan Xu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We provide algorithms and analysese for robust regularized Markov decision process with linear function approximation.

Abstract: The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.

Lay Summary: When we teach computers to make decisions in uncertain situations—like guiding a robot or recommending medical treatments—we want them to make good choices even when things don’t go exactly as expected. However, there’s a tricky balance: if we prepare for every possible scenario, the system may become too cautious and perform poorly in real situations. In our work, we address this challenge by introducing a new method that helps computers strike a better balance between safety and effectiveness. Instead of assuming that every possible change is equally likely, we add structure to how the computer models uncertainty. This helps it focus on realistic changes while ignoring highly unlikely ones. We call this new framework $d$-RRMDP. We also design a new algorithm called R2PVI, which can learn robust decision-making strategies from pre-collected data—without needing new interactions with the environment—and it does so efficiently. Through both theoretical analysis and experiments, we show that our method learns decision strategies that are both robust to change and more practical than existing approaches. Our work not only introduces new tools but also provides insights that can guide future research in robust and reliable decision-making systems.

Link To Code: https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration

Primary Area: Reinforcement Learning

Keywords: Reinforcement Learning, Robust Markov Decision Process

Submission Number: 11357

Loading