Boltzmann Exploration for Heavy-Tailed Bandits

Published: 03 Feb 2026, Last Modified: 23 Apr 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a Boltzmann-style algorithm for heavy-tailed bandits that admits closed-form action-selection probabilities.
Abstract: We study the stochastic multi-armed bandit problem with heavy-tailed rewards, assuming only that each arm's reward distribution has a finite $p$-th moment for $p\in(1,2]$. Although prior work has proposed algorithms that are robust to heavy-tailed rewards, these methods do not admit closed-form action-selection probabilities. This hinders efficient offline evaluation and can introduce bias in inverse propensity weighting (IPW) estimators. We propose heavy Boltzmann exploration (H-BE), a Boltzmann-style randomized policy whose action-selection probabilities remain available in closed form under heavy-tailed noise. Theoretically, we show that H-BE achieves the minimax-optimal gap-independent regret bound $O(\nu^{\frac{1}{p}} K^{1-\frac{1}{p}} T^{\frac{1}{p}})$. It also attains the gap-dependent regret bound $O(\sum_{i:\Delta_i>0}{\log(T \Delta_i^{\frac{p}{p-1}}/K)}/{\Delta_i^{\frac{1}{p-1}}})$, where $\nu$ bounds the $p$-th moment, $K$ is the number of arms, $T$ is the horizon, and $\Delta_i$ is the suboptimality gap of arm $i$. Empirically, H-BE attains competitive cumulative regret relative to state-of-the-art baselines, while its explicit propensities enable more stable and efficient offline evaluation.
Submission Number: 524
Loading