Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning

Published: 01 Jan 2024, Last Modified: 08 Oct 2024IEEE Trans. Emerg. Top. Comput. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action ( $Q$ ) value function. In TD learning, it is critical to estimate $Q$ values of greedy actions more accurately because a more accurate target $Q$ value enhances the estimation accuracy of $Q$ value. To improve the estimation accuracy of $Q$ value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the $Q$ values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the $Q$ value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.
Loading