Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction
TL;DR: Online learning of the distributionally robust Markov decision process
Abstract: Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
Lay Summary: When training an agent to make decisions by interacting with a simulated environment, it is crucial that the agent continues to perform well even if the real environment differs slightly. But how can we determine whether such robust learning is possible, and under what conditions?
To address this question, we introduce a simple metric that compares how easily the agent can reach certain states in the training environment versus in the perturbed environment. This measure captures the difficulty of using exploration in the nominal environment to gather enough information for estimating the perturbed environment. When this value remains bounded, we design a learning algorithm and prove that robust learning is achievable.
Our findings offer a quantitative framework to assess the impact of environmental changes on learning performance and help guide the development of algorithms that remain effective under uncertainty. We also provide sample complexity estimates for learning such a robust policy.
Link To Code: https://github.com/panxulab/Online-Robust-Bellman-Iteration
Primary Area: Reinforcement Learning->Online
Keywords: online learning, distributionally robust Markov decision process, distribution shift
Submission Number: 7109
Loading