Quantum Algorithm for Online Learning of MDPs with Continuous State Space

Andris Ambainis; Joao F. Doriguello; Debbie Lim

Quantum Algorithm for Online Learning of MDPs with Continuous State Space

Andris Ambainis, Joao F. Doriguello, Debbie Lim

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantum algorithm, Markov decision processes, Online algorithms, Quantum reinforcement learning

TL;DR: This paper proposes a quantum online algorithm for learning Markov Decision processes with continuous state space, achieving a $O(\sqrt T)$ regret, improving upon the best known classical result of $O(T^{2/3})$.

Abstract: We propose a novel quantum online algorithm for learning Markov Decision Processes (MDPs) with continuous state space in the average reward model. Our algorithm is based on the line of work on classical online UCCRL algorithms by Ortner and Ryabko (NeurIPS'12). To the best of our knowledge, our work is the first to consider MDPs with continuous state space in the fault-tolerant quantum setting. In the case where the state space is one-dimensional, we show that, via quantum-accessible environments, our quantum algorithm obtains a $\tilde O(T^{1/2})$ regret, improving upon the $\tilde O(T^{2/3})$ bound of Lakshmanan, Ortner, and Ryabko (PMLR'15), where $T$ is the number of iterations of the algorithm. For a general $d$-dimensional state space, the regret is bounded by $\tilde O(T^{1-1/2d})$. Our quantum algorithm uses quantum extended value iteration as a subroutine, which is our second main contribution, and may be of independent interest. We show that quantum extended value iteration achieves a subquadratic speedup in the size of the discretized state space $\mathcal{S}$ and a quadratic speedup in the size of the action space $\mathcal{A}$, as compared to its classical counterpart. As our third contribution, we study the limiting behaviour of the sequence of value functions generated by quantum extended value iteration. We show that the sequence converges to the optimal average reward $\rho^*$ up to $\epsilon$ additive error, for some small $\epsilon>0$.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10711

Loading