Bandit Learning for Online Scheduling with Immediate Decision

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-armed bandit, online scheduling
Abstract: Online scheduling has been extensively studied in computer science and economics owing to its broad applications. Motivated by streaming task processing in domains such as IoT data streaming and cloud resource allocation, we investigate an online scheduling setting where the scheduler must immediately decide whether to accept an incoming task. Consider a system with $M$ identical machines. At each time step, multiple tasks arrive, and each machine must immediately assign itself to a task or remain idle. Tasks that are not processed immediately are abandoned and cannot be revisited. Upon completion, a task yields a reward, which may be stochastic and initially unknown. Through repeated task completions, the scheduler can learn the reward distributions over time. In this work, we formalize this problem as online scheduling with immediate decision. We first analyze the setting with known rewards, for which we derive a worst-case competitive ratio and propose a near-optimal online algorithm. For the case of unknown and random rewards, we design an efficient bandit algorithm that balances exploration and exploitation, achieving an $O(\log T)$ regret over a time horizon $T$. Experimental results demonstrate the efficacy of the proposed algorithms.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 11728
Loading