Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.
Keywords: actor-critic, mirror descent, off-policy, policy optimization
TL;DR: We investigate the empirical feasibility of mirror-descent updates in off-policy actor-critic.
Abstract:
Submission Number: 211
Loading