**Keywords:**Offline Reinforcement Learning, Neural Networks

**TL;DR:**A provably and computationally efficient algorithm for offline RL with deep neural networks

**Abstract:**We propose a novel offline reinforcement learning (RL) algorithm, namely PEturbed-Reward Value Iteration (PERVI) which amalgamates the randomized value function idea with the pessimism principle. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, PERVI implicitly obtains pessimism by simply perturbing the offline data for multiple times with carefully-designed i.i.d Gaussian noises to learn an ensemble of estimated state-action values and acting greedily to the minimum of the ensemble. The estimated state-action values are obtained via fitting a parametric model (e.g. neural networks) to the perturbed datasets using gradient descent. As a result, PERVI only needs $\mathcal{O}(1)$ time complexity for action selection while LCB-based algorithms require at least $\Omega(K^2)$, where $K$ is the total number of trajectories in the offline data. We also propose a novel data splitting technique that helps remove the potentially large log covering number in the learning bound. We prove that PERVI yields a provable uncertainty quantifier with overparameterized neural networks and achieves an $\tilde{\mathcal{O}}\left( \frac{ \kappa H^{5/2} \tilde{d} }{\sqrt{K}} \right)$ sub-optimality where $\tilde{d}$ is the effective dimension, $H$ is the horizon length and $\kappa$ measures the distributional shift. We corroborate the statistical and computational efficiency of PERVI with an empirical evaluation in a wide set of synthetic and real-world datasets. To the best of our knowledge, PERVI is the first offline RL algorithm that is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.

**Anonymous Url:**I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

**No Acknowledgement Section:**I certify that there is no acknowledgement section in this submission for double blind review.

**Code Of Ethics:**I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

**Submission Guidelines:**Yes

**Please Choose The Closest Area That Your Submission Falls Into:**Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

10 Replies

Loading