Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Zhong Zheng; Haochen Zhang; Lingzhou Xue

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Zhong Zheng, Haochen Zhang, Lingzhou Xue

Published: 22 Jan 2025, Last Modified: 10 Mar 2025ICLR 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Q-Learning, Regret

TL;DR: This paper analyzes the gap-dependent regrets and policy switching costs of two Q-Learning algorithms with variance reduction.

Abstract: We study the gap-dependent bounds of two important algorithms for on-policy $Q$-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the {almost optimal} $\sqrt{T}$-type regret bound in the worst-case scenario, where $T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for $Q$-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for $Q$-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in $T$ and improve upon existing ones for $Q$-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for $Q$-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for $Q$-learning.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 953

Loading