Keywords: Weakly-Coupled MDPs; Sample Complexity; Average Reward; Reinforcement Learning
TL;DR: We study learning in average-reward weakly coupled Markov decision processes (WCMDPs) with heterogeneous arms.
Abstract: We study learning in average-reward weakly coupled Markov decision processes (WCMDPs) with heterogeneous arms. Naive approaches suffer exponential computation and sample complexity in the number of subsystems. We study a plug-in approach built on an efficient planning algorithm, which attains the first finite-sample (PAC) optimality-gap guarantees with polynomial sample complexity. This result is established under a new framework built on a Lyapunov analysis of a reference policy combined with a Lyapunov drift transfer technique, which can be viewed as a generalization of the classical simulation lemma.
Submission Number: 189
Loading