Reinforcement Learning with Thompson Sampling: No-Regret Performance over Finite Horizons

Published: 12 Jun 2025, Last Modified: 09 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Theory
Keywords: Thompson sampling, reinforcement learning, Gaussian processes, regret bounds
TL;DR: We derive no-regret guarantees for Thompson sampling in episodic reinforcement learning with Gaussian process modelling.
Abstract: How should agents explore efficiently over extended horizons in sequential decision-making problems? While Thompson sampling (TS) is a widely used exploration strategy, its theoretical performance in reinforcement learning (RL) in the presence of complex temporal structure remains poorly understood. We take a step towards filling this gap by studying TS in episodic RL using models with Gaussian marginal distributions, namely joint Gaussian process (GP) priors over rewards and transitions. To characterize the impact of temporal structure on performance, we derive a regret bound of $\tilde{\mathcal{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses the way that uncertainty compounds through recursive updates and offers insights into how uncertainty and temporal structure influences exploration.
Serve As Reviewer: ~Jasmine_Bayrooti1, ~Sattar_Vakili1
Submission Number: 69
Loading