Deployment Efficient Reward-Free Exploration with Linear Function Approximation

Zihan Zhang; Lin Yang; Yuxin Chen; Jason D. Lee; Simon Shaolei Du; Ruosong Wang

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

Zihan Zhang, Lin Yang, Yuxin Chen, Jason D. Lee, Simon Shaolei Du, Ruosong Wang

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Linear MDP, Deployment Complexity, Sample Complexity

TL;DR: We design a reward-free exploration algorithm for linear MDPs with nearly optimal deployment efficiency without relying on any additional assumptions.

Abstract: We study deployment efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of exploration policies used during the algorithm. We design a new reinforcement learning (RL) algorithm whose sample complexity is polynomial in the feature dimension and horizon length, while achieving nearly optimal deployment efficiency for linear MDPs under the reward-free exploration setting. More specifically, our algorithm explores a linear MDP in a reward-free manner, while using at most $H$ exploration policies during its execution where $H$ is the horizon length. Compared to previous algorithms with similar deployment efficiency guarantees, the sample complexity of our algorithm does not depend on the reachability coefficient or the explorability coefficient of the underlying MDP, which can be arbitrarily small for certain MDPs. Our result addresses an open problem proposed in prior work. To achieve such a result, we show how to truncate state-action pairs of the underlying linear MDP in a data-dependent manner, and devise efficient offline policy evaluation and offline policy optimization algorithms in the truncated linear MDP. We further show how to implement reward-free exploration mechanisms in the linear function approximation setting by carefully combines these offline RL algorithms without sacrificing the deployment efficiency.

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13139

Loading