Boltzmann Exploration for Heavy-Tailed Bandits
TL;DR: We propose a Boltzmann-style algorithm for heavy-tailed bandits that admits closed-form action-selection probabilities.
Abstract: We consider the stochastic multi‑armed bandit problem with heavy-tailed rewards, assuming only that each arm's reward distribution has a finite $p$-th moment for $p\in(1,2]$. Although prior work has proposed algorithms robust to heavy-tailed rewards, these methods do not admit closed-form action-selection probabilities, hindering efficient offline evaluation and potentially introducing bias in inverse propensity weighting (IPW) estimators. We propose heavy Boltzmann exploration (H-BE), a Boltzmann-style randomized policy whose action-selection probabilities are available in closed form even under heavy-tailed noise. Theoretically, we establish that H-BE attains the minimax-optimal gap-independent regret bound $O(\nu^{\frac{1}{p}} K^{1-\frac{1}{p}} T^{\frac{1}{p}})$ and a gap-dependent regret bound $O(\sum_{i:\Delta_i>0}{\log(T \Delta_i^{\frac{p}{p-1}}/K)}/{\Delta_i^{\frac{1}{p-1}}})$ where $\nu$ bounds the $p$-th moment, $K$ is the number of arms, $T$ is the horizon, and $\Delta_i$ denotes the suboptimality gap of arm $i$. Empirically, H-BE shows competitive cumulative regret relative to state-of-the-art baselines, and its explicit propensities enable more stable and efficient IPW-based offline evaluation.
Submission Number: 524
Loading