Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

Youngmin Oh, Jinje Park, Taejin Paik

Published: 30 Apr 2026, Last Modified: 27 Jan 2026AISTATS 2026EveryoneCC BY 4.0

Abstract: We introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the absence of a closed-form estimator, which led prior work to require an extremely large network width $ m $ (i.e., $ m = \widetilde{\Omega}(T^{14}) $). We address this constraint with a novel analytical approach that combines iterative self-improvement with spectral analysis. Our analysis significantly reduces the network width requirement to $ m = \widetilde{\Omega}(T^{6}) $, and shows that our algorithms achieve a sublinear regret of $ \widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t=1}^{T} \sigma_t^2} + \sqrt{dT}\right) $ under both UCB and TS frameworks. Empirical results show that the proposed algorithms are not only computationally efficient and exhibit sublinear regret in practical settings, but also achieve state-of-the-art performance on both synthetic and real-world tasks.