**Keywords:**Offline Reinforcement Learning, Statistical Reinforcement Learning, Horizon-free, Reinforcement Learning Theory

**TL;DR:**We provide a simple yet sharp analysis for offline policy evaluation and optimization that escape from the polynomial dependency on the horizon on both tabular MDP and linear MDP with anchor points assumption.

**Abstract:**We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by 1. Specifically: • For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}(H, S, A, d)$ in higher-order term. • For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [Cui and Yang 2020] that has additional $\mathrm{poly} (H, S, d)$ factors in the main term. To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.

**Supplementary Material:**pdf

**Code Of Conduct:**I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

24 Replies

Loading