Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPsDownload PDF


22 Sept 2022, 12:36 (modified: 18 Nov 2022, 16:30)ICLR 2023 Conference Blind SubmissionReaders: Everyone
Keywords: Reward Free Exploration, Representation Learning, Sample Complexity, Model-Based RL
TL;DR: We propose a novel reward free reinforcement learning algorithm under low-rank MDPs, which improves the sample complexity of previous work. We also provide a lower bound. Finally we study representation learning via reward free reinforement learning.
Abstract: In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information in order to achieve certain learning goals afterwards for any given reward. While reward-free RL has been well studied under the tabular setting with minimax optimal sample complexity being achieved, theoretical study of reward-free RL with complicated function approximation is still limited. In this paper we focus on reward-free RL under low-rank MDP models, which capture the representation learning in RL. We propose a new model-based algorithm, coined RAFFLE, and show that it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity of $\tilde{O}(\frac{H^3d^2K(d^2+K)}{\epsilon^2})$, where $d$, $H$ and $K$ respectively denote the representation dimension, episode horizon, and action space cardinality. This significantly improves the sample complexity of $\tilde{O}(\frac{H^{22}K^9d^7}{\epsilon^{10}})$ in Agarwal et al. (2020) for the same learning goals. We further provide a sample complexity lower bound of $\tilde{\Omega}(\frac{HdK}{\epsilon^2})$ that holds for any reward-free algorithm under low-rank MDPs, which matches our upper bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime. Comparing this lower bound for low-rank MDPs with the upper bound for linear MDPs in Wang et al. (2020), it implies that reward-free RL under low-rank MDPs is strictly harder than linear MDPs. Finally, we complete our study by reusing RAFFLE to learn representation. We estimate the representation individually with only access to the learned transition kernels from RAFFLE and without interacting with true environment, and then theoretically characterize the closeness between the learned and the ground truth representation. The learned representation can be further used for few shot RL as in supervised learning (Du et al., 2021b).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
15 Replies