Abstract: To mitigate the extrapolation error arising from offline reinforcement learning (RL) paradigm, existing methods typically make learned Q-functions over-conservative or enforce global policy constraints. In this article, we propose a dual behavior regularized offline deterministic Actor-Critic (DBRAC) by simultaneously performing behavior regularization on the coupling-iterative policy evaluation (PE) and policy improvement (PI) in the policy iteration process. In the PE phase, the difference between the Q-function and behavior value is first taken as the anti-exploration behavior value regularization term to drive the Q-function toward its true Q-value, which significantly reduces the conservatism of learned Q-function. In the PI phase, the estimated action variances of behavior policy in different states are then utilized for designing the weight and threshold of mild-local behavior cloning regularization term, which standardizes the local improvement potential of learned policy. Experiments on the well-known datasets for deep data-driven RL (D4RL) demonstrate that the DBRAC can quickly learn more competitive task-solving policies in various offline situations with different data qualities, significantly outperforming state-of-the-art offline RL baselines.
Loading