Abstract: We explore the application of expressions for the expected maxima of sets of random variables to compute the mean of the distribution of fixed points of the Bellman optimality equation for large state space Markov decision processes (MDPs), under a Bayesian framework. Current approaches to computing the statistics of the value functions in reinforcement learning rely on bounds and estimates that do not exploit statistical properties of the Bellman equation which can arise in the large system limit. Specifically, we utilise a recently developed mean field theory called \emph{dynamic mean field programming}, to compute the moments of the value function. Under certain conditions on the MDP structure and beliefs this mean field theory is exact. Computing the solution to the mean field equations however relies on computing expected maxima, and previous approaches were limited to identically distributed rewards. We improve the existing estimates and generalise to non-identical settings. We analyse the resulting approximations to the mean field equations, establishing Lyapunov stability and contractive properties.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: The new revision includes updates to:
Section 1. Introduction
Section 2. Background (2.2 and 2.3 in particular)
Section 4. and 5. include new simulations.
Assigned Action Editor: ~Jean_Barbier2
Submission Number: 1216
Loading