Keywords: contextual bandits, bandits, sequential learning, regret bounds
Abstract: We consider the adversarial linear contextual bandit setting, which
allows for the loss functions associated with each of $K$ arms to change
over time without restriction. Assuming the $d$-dimensional contexts are
drawn from a fixed known distribution, the worst-case expected regret
over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd
T})$. Under the additional assumption that the density of the contexts
is log-concave, we obtain a second-order bound of order $\tilde
O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the
learner's losses $V_T$, and a closely related first-order bound of order
$\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best
policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than
$T$, these improve over the worst-case regret whenever the environment
is relatively benign. Our results are obtained using a truncated version
of the continuous exponential weights algorithm over the probability
simplex, which we analyse by exploiting a novel connection to the linear
bandit setting without contexts.
Supplementary Material: pdf
Submission Number: 11499
Loading