MDP Planning as Policy Inference

12 Apr 2026 (modified: 05 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We formulate episodic Markov decision process (MDP) planning as Bayesian inference over policies. The primary contribution is conceptual: the policy itself is treated as the latent variable, and expected return defines an unnormalized posterior density over policies. This preserves the standard expected-return objective, in contrast to trajectory-centric planning-as-inference formulations that introduce auxiliary optimality variables and to entropy-regularized policy optimization methods that solve a different objective. In the exact formulation, the posterior over deterministic policies induces what we define here as an optimal stochastic policy under preference uncertainty, namely the stochastic policy induced by that posterior. For discrete MDPs with stochastic transitions, we study variational sequential Monte Carlo (VSMC) as one approximate inference method for this posterior, introducing policy consistency under state revisitation and coupled transition randomness across particles. Experiments on grid worlds, Blackjack, Triangle Tireworld, and Academic Advising examine the consequences of inference over policies and compare its induced behavior with entropy-regularized policy optimization. The results support the view that MDP planning can be naturally cast as Bayesian inference over policies.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Omar_Rivasplata1
Submission Number: 8371
Loading