Keywords: Reinforcement learning theory, regret analysis, Langevin, log Sobolev inequality, PSRL, MDP
TL;DR: We show that isoperimetry of the posterior distribution is enough to design exact and approximate PSRL algorithm with sublinear regret.
Abstract: In Reinforcement Learning theory, we often assume restrictive assumptions, like linearity and RKHS structure on the model, or Gaussianity and log-concavity of the posteriors over models, to design an algorithm with provably sublinear regret. But RL in practice is known to work for wider range of distributions and models. Thus, we study whether we can design efficient low-regret RL algorithms for any isoperimetric distribution, which includes and extends the standard setups in the literature to non-log-concave and perturbed distributions. Specifically, we show that the well-known PSRL (Posterior Sampling-based RL) algorithm yields sublinear regret if the sequence of posterior distributions satisfy the Log-Sobolev Inequality (LSI), which is a form of isoperimetry, with linearly growing constants. Further, for the cases where we cannot compute or sample from an exact posterior, we propose a Langevin sampling-based algorithm design scheme, namely LaPSRL. We show that LaPSRL also achieves order optimal regret if the posteriors satisfy LSI. Finally, we deploy a version of LaPSRL with a Langevin sampling algorithms, SARAH-LD. We numerically demonstrate their performances in different bandit and MDP environments. Experimental results validate the generality of LaPSRL across environments and its competetive performance with respect to the baselines.
Submission Number: 34
Loading