Revisiting Overestimation Bias of Q-learning: Breaking Bias Propagation Chains Does Well

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Q-learning, Propagation Bias, Alternating Q-learning, Adaptive Alternating Q-learning
TL;DR: Breaking Bias Propagation Chains Does Well
Abstract: This paper revisits the overestimation bias of Q-learning from a new perspective, i.e., the breaking bias propagation chains. We make five-fold contributions. First, we analyze the estimation bias propagation chains of Q-learning, and find that the bias propagated from previous steps dominates the maximum Q-value estimation bias and slows the convergence speed, instead of the current bias. Second, we propose a novel positive-negative bias alternating algorithm called \underline{A}lternating \underline{Q}-learning (AQ). It breaks the unidirectional estimation bias propagation chains via alternately executing Q-learning and Double Q-learning. We show theoretically that there exist two suitable alternating parameters to eliminate the propagation bias. Third, we design an adaptive alternating strategy for AQ, obtaining \underline{Ada}ptive \underline{A}lternating \underline{Q}-learning (AdaAQ). It applies a softmax strategy with the absolute value of TD error to choose Q-learning or Double Q-learning for each state-action pair. Fourth, we extend AQ and AdaAQ to the large-scale settings with function approximation, i.e., including both discrete- and continuous-action Deep Reinforcement Learning (DRL). Fifth, both discrete- and continuous-action DRL experiments show that our method outperforms several baselines drastically; tabular MDP experiments reveal fundamental insights into why our method can achieve superior performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10438
Loading