Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

Souradip Chakraborty; Amrit Bedi; Alec Koppel; Pratap Tokekar; Furong Huang; Dinesh Manocha

Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

Souradip Chakraborty, Amrit Bedi, Alec Koppel, Pratap Tokekar, Furong Huang, Dinesh Manocha

Published: 29 Nov 2022, Last Modified: 06 Jul 2025SBM 2022 PosterReaders: Everyone

Keywords: Model based Reinforcement Learning, Posterior Sampling Reinforcement Learning, Kernelized Stein Discrepancy, Bayesian RL, Score-based RL, Posterior Compression

TL;DR: An efficient Bayes regret for Posterior sampling reinforcement learning with Kernelized Stein Discrepancy

Abstract: Model-based reinforcement learning (MBRL) exhibits favorable performance in practice, but its theoretical guarantees are mostly restricted to the setting when the transition model is Gaussian or Lipschitz and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assumptions on the target transition model to belong to a generic family of mixture models; (ii) is applicable to large-scale training by incorporating a compression step such that the posterior estimate consists of a \emph{Bayesian coreset} of only statistically significant past state-action pairs; and (iii) {exhibits a Bayesian regret of $\mathcal{O}(dH^{1+({\alpha}/{2})}T^{1-({\alpha}/{2})})$ with coreset size of $\Omega(\sqrt{T^{1+\alpha}})$, where $d$ is the aggregate dimension of state action space, $H$ is the episode length, $T$ is the total number of time steps experienced, and $\alpha\in (0,1]$ is the tuning parameter which is a novel introduction into the analysis of MBRL in this work}. To achieve these results, we adopt an approach based upon Stein's method, which allows distributional distance to be evaluated in closed form as the kernelized Stein discrepancy (KSD). Experimentally, we observe that this approach is competitive with several state-of-the-art RL methodologies, and can achieve up to $50\%$ reduction in wall clock time in some continuous control environments.

Student Paper: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/posterior-coreset-construction-with/code)

1 Reply

Loading