Robustify the Latent Space: Offline Distributionally Robust Reinforcement Learning with Linear Function Approximation

Zhipeng Liang; Xiaoteng Ma; Jose Blanchet; Mingwen Liu; Jiheng Zhang; Zhengyuan Zhou

Robustify the Latent Space: Offline Distributionally Robust Reinforcement Learning with Linear Function Approximation

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Mingwen Liu, Jiheng Zhang, Zhengyuan Zhou

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Distributionally robust optimization, Offline Reinforcement Learning, Linear Function Approximation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Among the reasons hindering the applications of reinforcement learning (RL) to real-world problems, two factors are critical: limited data and the mismatch between the test environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper simultaneously addresses these issues with offline distributionally robust RL, where a distributionally robust policy is learned using historical data from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and design a novel linear function approximation framework that robustifies the latent space. Our framework is instantiated into two settings, one where the dataset is well-explored and the other where the dataset has weaker data coverage. In addition, we introduce a value shift algorithmic technique specifically designed to suit the distributionally robust nature, which contributes to our improved theoretical results and empirical performance. Sample complexities $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ are established respectively as the first non-asymptotic results in these settings, where $d$ denotes the dimension in the linear function space and $N$ represents the number of trajectories in the dataset. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithms against the non-robust one.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1966

Loading