VQ-learning: Towards Unbiased Action Value Estimation in Reinforcement Learning

TMLR Paper2599 Authors

29 Apr 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Q-learning as a well-known Reinforcement Learning algorithm is prone to overestimation of action values in stochastic settings. Such an overestimation is mainly due to the use of the max operator when updating the Q function. Deep Q-learning (DQN) suffers from the same problem which is further aggravated by noisy learning environment, and can lead to substantial degradation of reward performance. In this work, we introduce a simple yet effective method called VQ-learning, along with the extended version using function approximation, called Deep VQ-Networks (DVQN), which regulates the estimation of action values and effectively tackles the issue of biased value estimation. While Double Q-learning has been proposed to tackle the same issue, we showcase that VQ-learning provides better sample efficiency, even when the overestimation bias preconditions are eliminated. We also evaluate DVQN on Atari-100k benchmark and demonstrate that DVQN consistently outperforms Deep Q-learning, Deep Double Q-learning, Clipped Deep Double Q-learning, Averaged DQN and Dueling Deep Q-learning in terms of reward performance and sample efficiency. Moreover, our experimental results show that DVQN serves as a backbone network better than DQN, when combined with an additional representation learning objective.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Lihong_Li1
Submission Number: 2599
Loading