TL;DR: We propose a near sample optimal model-free algorithm for weakly communicating average reward MDPs which is implementable without prior knowledge
Abstract: We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\||h\||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ \||h\|| $. Although the complexity bound involves the span seminorm $ \||h\|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.
Lay Summary: We propose a new algorithm, SAVIA+, for solving average reward Markov decision processes (MDPs) in the generative model setting. To the best of our knowledge, SAVIA+ is the most efficient model-free algorithm to date, nearly matching the lower bound on sample complexity. Unlike many prior methods, it does not require any prior knowledge of the optimal solution. Our approach combines two key ideas, recursive sampling and Halpern's anchored iteration, and we also show how these techniques can be extended to address discounted MDPs.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Average-reward MDP, Generative model, Sample complexity analysis, Value iteration, Halpern iteration, Variance reduction sampling, Reinforcement learning theory
Submission Number: 7949
Loading