Keywords: Satisficing Bandits, Multiarmed Bandits, Online Learning
TL;DR: We study a multi-armed bandit problem where the learner aims for a satisficing performance rather than an optimal one, and propose new upper and lower bounds for this problem.
Abstract: In this work, we consider a variation of the stochastic multi-armed bandit problem in which the learner is not necessarily trying to compete with the best arm, whose performance is not known ahead of time, but is satisfied with playing any arm that performs above a known satisficing threshold $S$. Michel et al. (2023) considered as respective performance measure the \textit{satisficing regret}, that scales in terms of the gaps between the expected performance of an insufficient arm and the threshold $S$, rather than in terms of its gap with the best arm. While Michel et al. propose an algorithm that achieves time-independent satisficing regret, their results suffer when arms are too close to the threshold. Is this dependency unavoidable?
The first contribution of our work is to provide an alternative and more general lower bound for the $K$-armed satisficing bandit problem, which highlights how the position of the threshold compared to the arms affects the bound.
Then, we introduce an algorithm robust against unbalanced gaps, which enjoys a nearly matching time-independent upper bound. We also propose an alternative definition of the satisficing regret, which might be better tailored to measure algorithm performance in these difficult instances and derive a lower bound for this regret.
Finally, we include experiments to compare these different regret measures and our proposed algorithms empirically.
Supplementary Material: pdf
Submission Number: 124
Loading