Achieving the Asymptotically Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Yue Wang; Jinjun Xiong; Shaofeng Zou

Achieving the Asymptotically Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Yue Wang, Jinjun Xiong, Shaofeng Zou

Published: 09 Jun 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is {asymptotically minimax optimal}. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then show that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem. We first design a metric-based distribution-based uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $\epsilon$, the sample complexity is $\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative value-function-based uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which asymptotically matches with the minimax lower bound for offline reinforcement learning, and thus is asymptotically minimax optimal.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Gergely_Neu1

Submission Number: 1830

Loading