Variance-aware decision making with linear function approximation under heavy-tailed rewards

Published: 18 Apr 2024, Last Modified: 18 Apr 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: This paper studies how to achieve variance-aware regrets for online decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the conditional variance of the reward at round $t$, $d$ is the feature dimension, {and $T$ is number of online rounds}. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a variance-aware regret bound of $\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: The initial submission: We use an additional times font package in the first submission so the font isn't right. We fixed the font in this new submission. After review: 2023.11.23 We revised our paper according to the reviewers' feedback. We highlight the revision using a different color. 2024.1.4 We correct some typos and polish the proof on Page 30. 2024.4.7 We submit the CR paper.
Assigned Action Editor: ~Nishant_A_Mehta1
Submission Number: 1715