A Variational Formulation of Reinforcement Learning in Infinite-Horizon Markov Decision Processes

Published: 17 Jun 2024, Last Modified: 17 Jun 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning in infinite-horizon Markov decision processes (MDPs) is typically framed as expected discounted return maximization. In this paper, we formulate an alternative principle for optimal sequential decision-making in infinite-horizon MDPs: variational Bayesian inference in transdimensional probabilistic models. In particular, we specify a probabilistic model of random size and consider the variational problem of finding an approximation to the posterior distribution over state--action trajectories conditioned on state--action trajectories that reflect a desired behavior. We derive a tractable variational objective for infinite-horizon settings, prove a variational dynamic-discount policy iteration theorem, show that fixed discount factor KL-regularized reinforcement learning objectives are special cases of dynamic-discount variational objectives, and prove that learning dynamic discount factors is optimal.
Format: Short format (up to 4 pages + refs, appendix)
Publication Status: No
Submission Number: 75
Loading