Value Improved Actor Critic Algorithms

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Dynamic Programming, Actor Critic, Policy Improvement
TL;DR: We extend actor critic and approximate policy iteration approaches with an additional policy improvement step, applied only to the policy evaluated by the value function.
Abstract: To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. \textit{value improvement}) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Yaniv_Oren1
Track: Regular Track: unpublished work
Submission Number: 17
Loading