Value Improved Actor Critic Algorithms

Yaniv Oren; Moritz Akiya Zanger; Pascal R. Van der Vaart; Matthijs T. J. Spaan; Wendelin Boehmer

Value Improved Actor Critic Algorithms

Yaniv Oren, Moritz Akiya Zanger, Pascal R. Van der Vaart, Matthijs T. J. Spaan, Wendelin Boehmer

Published: 01 Aug 2024, Last Modified: 09 Oct 2024EWRL17EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Actor Critic, Dynamic Programming, Policy Improvement, TD3, DDPG, Reinforcement Learning

TL;DR: We extend the Actor Critic framework with an additional policy improvement step used in the value update, improving or matching performance of respective baselines

Abstract: Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using *policy improvement operators* and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.

Supplementary Material: zip

Submission Number: 8

Loading