Posterior Sampling for Continuing Environments

Wanqiao Xu; Shi Dong; Benjamin Van Roy

Posterior Sampling for Continuing Environments

Wanqiao Xu, Shi Dong, Benjamin Van Roy

Published: 15 May 2024, Last Modified: 14 Nov 2024RLC 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, posterior sampling, average-reward MDP

TL;DR: We propose an reinforcement learning algorithm that is suitable for learning in an infinite horizon average-reward problem. When state visitation counts are not available, our algorithm scales gracefully to high-dimensional state spaces.

Abstract: Existing posterior sampling algorithms for continuing reinforcement learning (RL) rely on maintaining state-action visitation counts, making them unsuitable for complex environments with high-dimensional state spaces. We develop the first extension of posterior sampling for RL (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into scalable agent designs. Our approach, continuing PSRL (CPSRL), determines when to resample a new model of the environment from the posterior distribution based on a simple randomization scheme. We establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret in the tabular setting, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the {\it reward averaging time}, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze this random resampling approach. Our simulations demonstrate CPSRL's effectiveness in high-dimensional state spaces where traditional algorithms fail.

Submission Number: 277

Loading