TEAC: Intergrating Trust Region and Max Entropy Actor Critic for Continuous Control

Hongyu Zang; Xin Li; Li Zhang; Peiyao Zhao; Mingzhong Wang

TEAC: Intergrating Trust Region and Max Entropy Actor Critic for Continuous Control

Hongyu Zang, Xin Li, Li Zhang, Peiyao Zhao, Mingzhong Wang

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Reinforcement Learning, Trust region methods, Maximum Entropy Reinforcement Learning, Deep Reinforcement Learning

Abstract: Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to integrate both branches in a unified framework, thus benefiting from both sides. We first transform the original RL objective to a constraint optimization problem and then proposes trust entropy actor-critic (TEAC), an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. We prove that the policy evaluation and policy improvement in TEAC is guaranteed to converge. We compare TEAC with 4 state-of-the-art solutions on 6 tasks in the MuJoCo environment. The results show that TEAC outperforms state-of-the-art solutions in terms of efficiency and effectiveness.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We propose a novel off-policy trust entropy actor critic method to learn stable and sufficiently explored policies for continuous states and actions.

Reviewed Version (pdf): /references/pdf?id=bzTQQZQ6ix

14 Replies

Loading