Breaking the Total Variance Barrier: Sharp Sample Complexity for Linear Heteroscedastic Bandits with Fixed Action Set
Keywords: linear bandits, heteroscedastic noise, simple regret
Abstract: Recent years have witnessed increasing interests in tackling heteroscedastic noise in bandits and reinforcement learning \citep{zhou2021nearly, zhao2023variance, jia2024does, pacchiano2025second}. In these works, the cumulative variance of the noise $\Lambda = \sum_{t=1}^T \sigma_t^2$, where $\sigma_t^2$ is the variance of the noise at round $t$, has been used to characterize the statistical complexity of the problem, yielding simple regret bounds of order $\tilde{\mathcal O}(d \sqrt{\Lambda / T^2})$ for linear bandits with heteroscedastic noise \citep{zhou2021nearly, zhao2023variance}.
However, with a closer look, $\Lambda$ remains the same order even if the noise is close to zero at half of the rounds, which indicates that the $\Lambda$-dependence is not optimal.
In this paper, we revisit the linear bandit problem with heteroscedastic noise. We consider the setting where the action set is fixed throughout the learning process. We propose a novel variance-adaptive algorithm VAEE (Variance-Aware Exploration with Elimination) for large action set, which actively explores actions that maximizes the information gain among a candidate set of actions that are not eliminated. With the active-exploration strategy, we show that VAEE achieves a *simple regret* with a nearly *harmonic-mean* dependent rate, i.e. $\tilde{\mathcal O}\Big(d\Big[\sum_{t = 1}^T \frac{1}{\sigma_t^2} - \sum_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\Big)$ where $d$ is the dimension of the feature space and $\sigma^{(i)}$ is the $i$-th smallest variance among $\\{\sigma_t\\}_{t=1}^T$. For finitely many actions, we propose a variance-aware variant of G-optimal design based exploration, which achieves a
$\tilde {\mathcal O}$ $\bigg(\sqrt{d \log |\mathcal A| }\Big[ \sum\_{t = 1}\^T \frac{1}{\sigma\_t\^2}- \sum\_{i = 1}^{\tilde{O}(d)} \frac{1}{[\sigma^{(i)}]^2} \Big]^{-\frac{1}{2}}\bigg)$ simple regret bound. We also establish a nearly matching lower bound for the fixed action set setting indicating that \emph{harmonic-mean} dependent rate is unavoidable. To the best of our knowledge, this is the first work that breaks the $\sqrt{\Lambda}$ barrier for linear bandits with heteroscedastic noise.
Supplementary Material: pdf
Primary Area: reinforcement learning
Submission Number: 12713
Loading