IMED-RL: Regret optimal learning of ergodic Markov decision processes

Fabien Pesquerel; Odalric-Ambrym Maillard

IMED-RL: Regret optimal learning of ergodic Markov decision processes

Fabien Pesquerel, Odalric-Ambrym Maillard

Published: 31 Oct 2022, Last Modified: 13 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

Keywords: Reinforcement Learning, sequential learning, average-reward, lower bound, regret minimization, upper bound, optimal, Bandit, ergodic

TL;DR: We propose IMED-RL, a learning policy that is asymptotically optimal with respect to regret minimization problem under the average-reward criterion in ergodic MDPs with unknown reward and transition.

Abstract: We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov decision problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic \texttt{UCB}, \texttt{KL-UCB} or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (\texttt{IMED}) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is \emph{ergodic}. Under this assumption, we introduce \texttt{IMED-RL} and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, \textit{ergodic} and \textit{communicative} only, showing the competitiveness of \texttt{IMED-RL} in finite-time against state-of-the-art algorithms. \texttt{IMED-RL} also benefits from a lighter complexity.

Supplementary Material: zip

13 Replies

Loading