Value-Based Continuous Control Without Concrete State-Action Value Function

Jin Zhu, Haixian Zhang, Zhen Pan

Published: 01 Jan 2021, Last Modified: 12 May 2023ICSI (2) 2021Readers: Everyone

Abstract: In the value-based reinforcement learning continuous control, it is apparent that actions with higher expected return (state-action value, also as Q) will be selected as the action decision. But limited by the expression of deep Q function, researchers mostly introduce an independent policy function for approximating the preference of Q function. These methods, named actor-critic, implement value-based continuous control in an effective but compromise way. However, the policy function and the Q function are highly correlated in Maximum Entropy Reinforcement Learning, so that these two have a close-form solution on each other. By this fact, we propose to implement a value-based continuous control algorithm without concrete Q function, which infers a temporary Q function from policy when needed. Compare to the current maximum entropy actor-critic method, our method saves a Q network needing training and a step of policy optimization, which results in an advance in time efficiency, while remains state of art data efficiency in experiments.

0 Replies