Keywords: Offline Reinforcement Learning, Markov Decision Process, Nonlinear Function Approximation, Generalized Eluder Dimension
Abstract: Offline reinforcement learning, where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results being achieved under various assumptions, the theoretical understanding of offline RL with non-linear function approximation is still limited. Specifically, most existing works on offline RL with non-linear function approximation either have a poor dependency on the function class complexity or require an inefficient planning phase.
In this paper, we propose an oracle-efficient algorithm VAPVI for offline RL with non-linear function approximation. Our algorithm enjoys a regret bound that has a tight dependence on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. In our theoretical analysis, we introduce a new coverage assumption for general function approximation, bridging the minimum-eigenvalue assumption and the uncertainty measure widely used in online nonlinear RL. Our algorithmic design includes 1) a variance-based weighted regression scheme for general function classes; 2) a variance estimation subroutine and 3) a pessimistic value iteration planning phase. To the best of our knowledge, this is the first statistically optimal algorithm for nonlinear offline RL.
Supplementary Material: pdf
Submission Number: 9462
Loading