Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone; Zihan Zhang

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone, Zihan Zhang

Published: 25 Sept 2024, Last Modified: 17 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Markov decision processes, Regret, Average reward, Minimax, Optimism, Model-based

TL;DR: We provide the first tractable algorithm that achieves minimax optimal regret in average reward MDPs.

Abstract: In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first *tractable* algorithm with minimax optimal regret of $\mathrm{O}\left(\sqrt{\mathrm{sp}(h^*) S A T \log(SAT)}\right)$ where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S\times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration (`PMEVI`), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.

Supplementary Material: zip

Primary Area: Reinforcement learning

Submission Number: 15501

Loading