Near-Optimal Reinforcement Learning for Linear Distributionally Robust Markov Decision Processes

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Off-dynamics reinforcement learning; linear function approximation; distributionally robust Markov decision process
TL;DR: We provide a lower bound and a nearly optimal upper bound on the average suboptimality for the online setting of d-rectangular linear DRMDPs.
Abstract: We study off-dynamics reinforcement learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/{\rho}, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $\rho$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/\rho,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting. In stark contrast with standard linear MDPs, our lower bound depends on the uncertainty level $\rho$, revealing the unique feature of DRMDPs. Our algorithm also enjoys a `rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.
Submission Number: 56
Loading