Keywords: ensemble sampling, nonlinear contextual bandits, randomized exploration
TL;DR: We introduce ensemble-sampling algorithms for nonlinear contextual bandits (GLM-ES, Neural-ES) with anytime variants and prove near-state-of-the-art regret bounds along with strong empirical performance.
Abstract: We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (GLM-ES) for generalized linear bandits and Neural Ensemble Sampling (Neural-ES) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for GLM-ES and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for Neural-ES, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate GLM-ES, Neural-ES and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 15973
Loading