Abstract: Robust Markov Decision Processes (MDPs) address environmental uncertainty through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels. However, the vanilla DRO approach often proves overly conservative when facing significantly different environments, resulting in suboptimal performance. In this paper, we focus on obtaining a robust policy to address environmental shifts within the framework of two-environment MDPs: the source and target domains, characterized by their transition kernels. While the source domain MDP is predetermined, access to the target domain is limited to a few samples. Our goal is to derive a robust policy that performs well in the target domain. Underlying our approach is the construction of alternative uncertainty sets, obtained through constrained estimation. This estimation leverages available target domain data and various forms of side information from the source domain to mitigate environmental discrepancies.
We demonstrate the convergence of this estimation in total variation under mild assumptions. Moreover, we establish error bounds and convergence results for both the robust and non-robust value functions. Through our analysis, we illustrate the efficacy of our method in adapting to domain shifts. We assess the performance of our approach across popular OpenAI Gym environments and real-world control problems, consistently showcasing superior performance in both robust and non-robust scenarios within the target domain.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 3350
Loading