Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Tiancheng Jin; Tal Lancewicki; Haipeng Luo; Yishay Mansour; Aviv Rosenberg

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg

Published: 31 Oct 2022, Last Modified: 15 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

Keywords: regret minimization, adversarial mdp, delay, reinforcement learning, online learning

TL;DR: This paper is the first to achieve near-optimal regret in adversarial MDPs with delayed bandit feedback.

Abstract: The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

Supplementary Material: pdf

11 Replies

Loading