StaQ: a Finite Memory Approach to Discrete Action Policy Mirror Descent

StaQ: a Finite Memory Approach to Discrete Action Policy Mirror Descent

ICLR 2026 Conference Submission25527 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcemnt learning; entropy regularization; policy mirror descent; function approximators

TL;DR: We study a variant of PMD that keeps in memory the last M Q-functions, showing that it does not bias convergence and retains the averaging of error effect of PMD

Abstract: In Reinforcement Learning (RL), regularization with a Kullback-Leibler divergence that penalizes large deviations between successive policies has emerged as a popular tool both in theory and practice. This family of algorithms, often referred to as Policy Mirror Descent (PMD), has the property of averaging out policy evaluation errors which are bound to occur when using function approximators. However, exact PMD has remained a mostly theoretical framework, as its closed-form solution involves the sum of all past Q-functions which is generally intractable. A common practical approximation of PMD is to follow the natural policy gradient, but this potentially introduces errors in the policy update. In this paper, we propose and analyze PMD-like algorithms for discrete action spaces that only keep the last $M$ Q-functions in memory. We show theoretically that for a finite and large enough $M$, an RL algorithm can be derived that introduces no errors from the policy update, yet keeps the desirable PMD property of averaging out policy evaluation errors. Using an efficient GPU implementation, we then show empirically on several medium-scale RL benchmarks such as Mujoco and MinAtar that increasing $M$ improves performance up to a certain threshold where performance becomes indistinguishable with exact PMD, reinforcing the theoretical findings that using an infinite sum might be unnecessary and that keeping in memory the last $M$ Q-functions is a practical alternative to the natural policy gradient instantiation of PMD.

Primary Area: reinforcement learning

Submission Number: 25527

Loading