Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Zhihan Xiong; Ruoqi Shen; Qiwen Cui; Maryam Fazel; Simon Shaolei Du

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, Simon Shaolei Du

Published: 31 Oct 2022, Last Modified: 11 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

Keywords: randomized exploration, reinforcement learning theory, tabular MDP

TL;DR: We propose the first randomized algorithm that nearly achieves minimax optimal regret in tabular MDPs.

Abstract: We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.

Supplementary Material: pdf

18 Replies

Loading