Keywords: bandits, online learning
TL;DR: We study a new model of bandits where the distribution of the arms can be modified fixed amount of times by the agent.
Abstract: Motivated by the idea that large algorithmic infrastructures, such as neural networks, must carefully plan retraining procedures due to budget constraints, this work proposes a new multi-armed bandit framework in which the agent can modify the distribution of the arms up to $M$ times through the interaction. Each modification, referred to as _retraining step_, leads to an improvement in the arm distributions that either increases the reward obtained or makes the optimal arm easier to identify. Specifically, we analyze two settings: in the first (Improvable Arms), we assume that each retraining step increases the mean of all arms and reduces their variance. In the second setting (Decreasing Biases), we assume the observations of the reward to be obfuscated by some bias, which each retraining step helps to eliminate. For both models, we present successive-elimination-based algorithms and analyze their regret. We also prove regret lower bounds, showing that our algorithms exhibit optimal regret rates with respect to the time horizon $T$.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Matilde_Tullii1
Track: Regular Track: unpublished work
Submission Number: 36
Loading