Conservative Exploration in Linear MDPs under Episode-wise Constraints

Ruiquan Huang; Donghao Li; Cong Shen; Ashley Prater-Bennette; Jing Yang

Conservative Exploration in Linear MDPs under Episode-wise Constraints

Ruiquan Huang, Donghao Li, Cong Shen, Ashley Prater-Bennette, Jing Yang

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Conservative Exploration, Sample Complexity, Linear MDP, Offline and Online RL

TL;DR: We studied conservative exploration with offline dataset during online learning for Linear MDPs and prove that the regret of our algorithm matches the constraint-free counterpart.

Abstract: This paper investigates conservative exploration in reinforcement learning where the performance of the learning agent is guaranteed to above certain threshold throughout the learning process. It focuses on the episodic linear Markov Decision Process (MDP) setting where the transition kernels and the reward functions are assumed to be linear. With the knowledge of an existing safe baseline policy, two algorithms based on Least-Squares Value Iteration (LSVI) (Bradtke and Barto, 1996; Osband et al., 2016), coined StepMix-LSVI and EpsMix-LSVI, are proposed to balance the exploitation and exploration while ensuring that the conservative constraint is never violated in each episode with high probability. Theoretical analysis shows that both algorithms achieve the same regret order as LSVI-UCB, their constraint-free counterpart from Jin et al. (2020), indicating that obeying the stringent episode-wise conservative constraint does not compromise the learning performance of these algorithms. We further extend the analysis to the setting where the baseline policy is not given a priori but must be learned from an offline dataset, and prove that similar safety guarantee and regret can be achieved if the offline dataset is sufficiently large. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

Supplementary Material: zip

5 Replies

Loading