Keywords: reinforcement learning, posterior sampling, average-reward MDP
TL;DR: We propose an reinforcement learning algorithm that is suitable for learning in an infinite horizon average-reward problem. When state visitation counts are not available, our algorithm scales gracefully to high-dimensional state spaces.
Abstract: Existing posterior sampling algorithms for continuing reinforcement learning (RL) rely on maintaining state-action visitation counts, making them unsuitable for complex environments with high-dimensional state spaces. We develop the first extension of posterior sampling for RL (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into scalable agent designs. Our approach, continuing PSRL (CPSRL), determines when to resample a new model of the environment from the posterior distribution based on a simple randomization scheme. We establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret in the tabular setting, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the {\it reward averaging time}, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze this random resampling approach. Our simulations demonstrate CPSRL's effectiveness in high-dimensional state spaces where traditional algorithms fail.
Submission Number: 277
Loading