Optimistic Actor-Critic with Parametric Policies: Unifying Sample Efficiency and Practicality

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: Actor-Critic, Strategic Exploration, Sample Complexity Guarantees
Abstract: Although actor-critic (AC) methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods that require complicated algorithmic modifications. Moreover, the AC methods analyzed for finite-horizon MDPs often construct ``implicit'' policies without explicitly parameterizing them, further exacerbating the mismatch between theory and practice. To that end, we propose an optimistic AC framework with parametric policies that is both practical and equipped with theoretical guarantees for episodic linear MDPs. In particular, we introduce a tractable regression objective for the actor to train log-linear policies. This enables us to control the error between the parameterized actor and the easier-to-analyze implicit policies induced by natural policy gradient. To train the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. This results in a principled, yet flexible exploration scheme without any additional assumptions on the MDP. We prove that our algorithm achieves an $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity in the on-policy setting and an $\widetilde{\mathcal{O}}(\epsilon^{-2})$ complexity in the off-policy setting. Our algorithm matches prior theoretical work in achieving state-of-the-art sample efficiency, while being more aligned with practice.
Submission Number: 129
Loading