Worst-Case Offline Reinforcement Learning with Arbitrary Data Support

Kohei Miyaguchi

Worst-Case Offline Reinforcement Learning with Arbitrary Data Support

Kohei Miyaguchi

Published: 25 Sept 2024, Last Modified: 06 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: offline reinforcement learning, worst-case learning, support free, sample complexity analysis

Abstract: We propose a method of offline reinforcement learning (RL) featuring the performance guarantee without any assumptions on the data support. Under such conditions, estimating or optimizing the conventional performance metric is generally infeasible due to the distributional discrepancy between data and target policy distributions. To address this issue, we employ a worst-case policy value as a new metric and constructively show that the sample complexity bound of $O(\epsilon^{−2})$ is attainable without any data-support conditions, where $\epsilon>0$ is the policy suboptimality in the new metric. Moreover, as the new metric generalizes the conventional one, the algorithm can address standard offline RL tasks without modification. In this context, our sample complexity bound can be seen as a strict improvement on the previous bounds under the single-policy concentrability and the single-policy realizability.

Primary Area: Reinforcement learning

Submission Number: 4883

Loading