Stochastic Bandits on Mixture Distributions: Metrics & Regret Bounds

Published: 03 Feb 2026, Last Modified: 06 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: stochastic multi-arm bandits with reward distributions which is a multimodal Gaussian - how does one rank the arms and subsequently minimize the regret?
Abstract: Multimodal reward distributions naturally arise in real-world applications such as targeted recommendations to heterogeneous sub-populations and selective unit-level interventions. These settings challenge standard mean or risk-based bandit approaches, requiring metrics that quantify the merit of mixture parameters without prior mode knowledge. We consider the bandit setting where the reward associated with an arm is sampled from a finite mixture of Gaussians, which is strictly more general than the unimodal setting. We consider ranking arms using functions of the mixture parameters and propose methods to minimize the cumulative regret with respect to the induced ranking. We show that the achievable pseudo-regret has a lower bound of the order $\Omega(\mathsf{T}^{1/2})$ and propose an explore and exploit based on expectation maximization (ETE-EM) algorithm which achieves a regret of $\widetilde{\mathsf{O}}(\mathsf{T}^{2/3})$. Further, we show that the modification of Thompson sampling (TS-EM) achieves a Bayes regret of $\widetilde{\mathsf{O}}(\mathsf{T}^{1/2})$. Experiments validate our approach in practice, where we benchmark against both algorithms designed for sub-Gaussian bandits and naive clustering-based extensions of empirical CDF methods, showing our approach achieves consistently lower regret across choice of metrics.
Submission Number: 486
Loading