Investigating the Utility of Mirror Descent in Off-policy Actor-Critic

Samuel Neumann; Jiamin He; Adam White; Martha White

Investigating the Utility of Mirror Descent in Off-policy Actor-Critic

Samuel Neumann, Jiamin He, Adam White, Martha White

Published: 09 May 2025, Last Modified: 03 Sept 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: actor-critic, mirror descent, off-policy, policy optimization

TL;DR: We investigate the empirical feasibility of mirror-descent updates in off-policy actor-critic.

Abstract: Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.

Submission Number: 211

Loading