Keywords: actor-critic, mirror descent, off-policy, policy optimization
TL;DR: We investigate the empirical feasibility of mirror-descent updates in off-policy actor-critic.
Abstract: Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a
Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and
Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic
algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles
of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the
empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD
adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our
findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to
off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.
Submission Number: 211
Loading