Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Bhargav Ganguly; Yang Xu; Vaneet Aggarwal

Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Bhargav Ganguly, Yang Xu, Vaneet Aggarwal

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Abstract: This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$\footnote{$\tilde{\mathcal{O}}(\cdot)$ conceals logarithmic terms of $T$.}, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts, where $T$ is the length of the time horizon.

Lay Summary: Reinforcement learning systems that aims to operate for unlimited periods (i.e., warehouse robots or power grid controllers) aim to maximize their long-run average reward. Classical reinforcement learning agents under infinite horizon average-reward settings accumulate regret that inevitably grows like $\tilde{O}(\sqrt{T})$. We design Quantum-UCRL, the first algorithm that lets an agent tap a "quantum transition oracle." The oracle encodes all possible next-states in a single superposition; a quantum mean-estimation routine then extracts their statistics with quadratically fewer samples. A momentum-style update reuses information that would normally be destroyed by quantum measurement, and a new martingale-free analysis proves the method works. Together these ideas drive the worst-case regret down to $\tilde{O}(1)$, being only logarithmic dependent of time, beating the classical $\tilde{O}(\sqrt{T})$ lower bound by an exponential margin.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Quantum Machine Learning, Reinforcement Learning

Submission Number: 12534

Loading