From Restless to Contextual: A Thresholding Bandit Reformulation for Finite-horizon Improvement
Abstract: This paper addresses the poor finite-horizon performance of existing online *restless bandit* (RB) algorithms, which stems from the prohibitive sample complexity of learning a full *Markov decision process* (MDP) for each agent. We argue that superior finite-horizon performance requires *rapid convergence* to a *high-quality* policy. Thus motivated, we introduce a reformulation of online RBs as a *budgeted thresholding contextual bandit*, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We
propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving *faster convergence* than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.
Submission Number: 1844
Loading