Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach

Yang Xu; Vaneet Aggarwal

Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach

Yang Xu, Vaneet Aggarwal

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach

Abstract: We address the problem of quantum reinforcement learning (QRL) under model-free settings with quantum oracle access to the Markov Decision Process (MDP). This paper introduces a Quantum Natural Policy Gradient (QNPG) algorithm, which replaces the random sampling used in classical Natural Policy Gradient (NPG) estimators with a deterministic gradient estimation approach, enabling seamless integration into quantum systems. While this modification introduces a bounded bias in the estimator, the bias decays exponentially with increasing truncation levels. This paper demonstrates that the proposed QNPG algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for queries to the quantum oracle, significantly improving the classical lower bound of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for queries to the MDP.

Lay Summary: Reinforcement-learning agents often need huge amounts of experience before they behave well; for the widely used natural policy-gradient method, the theoretical lower bound of sample complexity is $\tilde{O}(\epsilon^{-2})$ in the classical setting, which is prohibitive in data-hungry fields like robotics and finance. We show that a future quantum computer can do better. Our Quantum Natural Policy Gradient (QNPG) algorithm prepares many candidate trajectories in quantum superposition, replaces random-length sampling with a fixed-length deterministic trick that is friendly to quantum hardware, and applies a quantum variance-reduction routine. Together these ideas yield a provably correct model-free algorithm whose sample complexity scales as $\tilde{O}(\epsilon^{-1.5})$, beating the best classical bound and that works with large, parameterized policies rather than just tiny tabular ones.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Quantum Machine Learning, Reinforcement Learning

Submission Number: 12496

Loading