Provably Correct SGD-Based Exploration for Generalized Stochastic Bandit Problem

Jialin Dong, Jiayi Wang, Lin F. Yang

Published: 01 Jan 2024, Last Modified: 26 Jul 2025SmartNets 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Bandit problems have been widely used in wireless communication systems which involve generalized reward models and may suffer high computational complexity. Despite the success of applying stochastic gradient descent (SGD) in stochastic bandits to reduce computational complexity, several limitations persist in state-of-the-art. Specifically, current papers only consider linear models which is not practical in wireless communication. Their algorithms are only guaranteed by the expected regret bound, which may not be effective when many actions are sub-optimal. Additionally, existing SGD-based approaches raise bias in the estimation due to a greedy action selection strategy, deviating from the conventional SGD approach that uniformly samples. To address these limitations, we propose an online SGD-based algorithm with a high probability regret bound guarantee, which can apply to stochastic bandits with general parametric reward functions. We develop an action-elimination strategy to gradually eliminate sub-optimal actions and uniformly at random select the action from the current action subset. This strategy guarantees an unbiased estimation of model parameters. Theoretically, we prove that our proposed algorithm can achieve the regret of $O(d \sqrt{n \log (n / \delta)})$ with probability at least $1-\delta$, where $n$ is‘ the number’ of time steps and $d$ is the dimension of model parameters, matching existing near-optimal regret bounds in UCB-type algorithms. We further conduct experiments to demonstrate the advantage of our algorithm.