MSMR: Bandit with Minimal Switching Cost and Minimal Marginal Regret

ICLR 2026 Conference Submission19880 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Regret, Switching Cost, CMAB
Abstract: Effectively balancing switching costs and regret remains a fundamental challenge in bandit learning, especially when the arms exhibit similar expected rewards. Traditional upper confidence bound (UCB) -based algorithms struggle with this trade-off by frequently switching during exploration, incurring high cumulative switching costs. Recent approaches attempt to reduce switching by introducing structured exploration or phase-based selection, yet they often do so at the expense of increased regret due to excessive exploitation of suboptimal arms. In this paper, we propose a new unified framework for bandit problems with switching costs, containing several classical algorithms, applicable to both Multi-Armed Bandits (MAB) and Combinatorial Multi-Armed Bandits (CMAB). Our approach is built on three key components: initial concentrated exploration, near-optimal exploitation, and predictive selection, which together achieve a principled balance between switching cost and regret. Based on this framework, we introduce the Minimal Switching Cost and Minimal Marginal Regret (MSMR) family of algorithms. Theoretically, we show that MSMR algorithms achieve a regret upper bound of $\mathcal{O}(\log n)$ over horizon $n$, incur only $\mathcal{O}((\log n)^{1-\varepsilon})$ switching cost, and its marginal loss has an upper bound of $\mathcal{O}(\lambda \sqrt{\log n})$ by setting $\varepsilon = 1/2$, where $\lambda$ and $\varepsilon \in (0,1)$ are hyper-parameters. Experiments show that MSMR algorithms reduce switching costs to 1.0\% (MAB) and 1.3\% (CMAB) of those incurred by standard baselines, while maintaining comparable regret, demonstrating their practical effectiveness.
Supplementary Material: pdf
Primary Area: learning theory
Submission Number: 19880
Loading