Keywords: bandits, policy gradient
TL;DR: We propose a novel regret analysis of a simple policy gradient algorithm for bandits, characterizing regret regimes depending on its learning rate.
Abstract: Recent works of Mei et al. (2023, 2024) have deepened the theoretical understanding of the *Stochastic Gradient Bandit* (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently *small* learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret *regimes* of SGB as a function of its learning rate. For two--armed bandits, we identify a sharp threshold, scaling with the sub-optimality gap $\Delta$, below which SGB achieves *logarithmic* regret on all instances, and above which it can incur *polynomial* regret on some instances.
This result highlights the necessity of knowing (or estimating) $\Delta$ to ensure logarithmic regret with a constant learning rate.
For general $K$-armed bandits, we further show the learning rate must scale inversely with $K$ to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 1760
Loading