Keywords: Regret, Adversarial Delays, Online Learning, Contextual Bandits with Delays
TL;DR: We present and analyze regret minimization algorithms for contextual bandits where the the reward observations arrive with delay.
Abstract: In this paper we present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem in the presence of delayed feedback, a scenario where reward observations arrive with delays chosen by an adversary. We study two fundamental frameworks in terms of the function classes used to derive regret bounds for CMAB. Firstly, for a finite policy class $ \Pi $, we establish an optimal regret bound of $ O \left( \sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|} \right) $, where $ K $ is the number of arms, $ T $ is the number of rounds, and $ D $ is the sum of delays. Secondly, assuming a finite contextual reward function class $ \mathcal{F} $ and access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$, we achieve a regret bound of $\widetilde{O}(\sqrt{KT\cdot (\mathcal{R}_T(\mathcal{O})+\log (\delta^{-1}))} + \eta D + d_m)$ that holds with probability at least $1-\delta$, where $d_m$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\eta$ is a stability parameter associated with the oracle.
Submission Number: 15
Loading