Careful at Estimation and Bold at Exploration for Deterministic Policy Gradient Algorithm

Xing Chen; Yijun Liu; Shutong Zhang; Siyuan Guo; Zhaogeng Liu; Yu Jin; haiyin piao; Hechang Chen; Hengshuai Yao; Yi Chang

Careful at Estimation and Bold at Exploration for Deterministic Policy Gradient Algorithm

Xing Chen, Yijun Liu, Shutong Zhang, Siyuan Guo, Zhaogeng Liu, Yu Jin, haiyin piao, Hechang Chen, Hengshuai Yao, Yi Chang

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: exploration, actor critic, out of distribution, deterministic policy

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a novel exploration method for continuous action RL task.

Abstract: Exploration strategies within continuous action spaces often adopt heuristic approaches due to the challenge of dealing with an infinite array of possible actions. Previous research has established the advantages of policy-based exploration in the context of deterministic policy reinforcement learning (DPRL) for continuous action spaces. However, policy-based exploration in DPRL presents two notable issues: unguided exploration and exclusive policy, both stemming from the soft policy learning schema, which is famous for DPRL policy learning. In response to these challenges, we introduce a novel approach called Bold Actor Conservative Critic (BACC), which leverages Q-value to guide out-of-distribution exploration. We extend the dynamic Boltzmann softmax update theorem to the double Q function framework, incorporating modified weights and Q values. This extension enables us to derive an exploration policy directly for policy exploration, which is constructed with the modified weights. Furthermore, our theorem offers substantial support for utilizing the minimum Q value as an intermediate step in policy gradient computation for policy optimization. In practice, we construct such an exploration policy with a limited set of actions and train a parameterized policy by minimizing the expected KL-divergence between the target policy and a policy constructed based on the minimum Q value. To evaluate the effectiveness of our approach, we conduct experiments on the Mujoco and Roboschool benchmarks, showcasing superior performance compared to previous state-of-the-art methods across a range of environments. Notably, our method excels in the highly complex Humanoid environment, demonstrating its efficacy in tackling challenging continuous action space exploration problems.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1151

Loading