TEAC: Intergrating Trust Region and Max Entropy Actor Critic for Continuous ControlDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Reinforcement Learning, Trust region methods, Maximum Entropy Reinforcement Learning, Deep Reinforcement Learning
Abstract: Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to integrate both branches in a unified framework, thus benefiting from both sides. We first transform the original RL objective to a constraint optimization problem and then proposes trust entropy actor-critic (TEAC), an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. We prove that the policy evaluation and policy improvement in TEAC is guaranteed to converge. We compare TEAC with 4 state-of-the-art solutions on 6 tasks in the MuJoCo environment. The results show that TEAC outperforms state-of-the-art solutions in terms of efficiency and effectiveness.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We propose a novel off-policy trust entropy actor critic method to learn stable and sufficiently explored policies for continuous states and actions.
Reviewed Version (pdf): https://openreview.net/references/pdf?id=bzTQQZQ6ix
14 Replies

Loading