Adaptively Phased Algorithm for Linear Contextual Bandits

TMLR Paper405 Authors

03 Sept 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a novel algorithm for the linear contextual bandit problem when the set of arms is finite. Recently the minimax expected regret for this problem is shown to be \(\Omega(\sqrt{dT\mathrm{log}T\mathrm{log}K})\) with \(T\) rounds, \(d\)-dimensional contexts, and \(K\leq 2^{d/2}\) arms per time. Previous works on phased algorithms attain this lower bound in the worst case up to logarithmic factors \citep{Auer, Chu11} or iterated logarithmic factors \citep{Li19}, but require a priori knowledge of the time horizon \(T\) to construct the phases, which limits their use in practice. In this paper we propose a novel phased algorithm that does not require a priori knowledge of \(T\), but constructs the phases in an adaptive way. We show that the proposed algorithm guarantees a regret upper bound of order \(O(d^{\alpha}\sqrt{T\mathrm{log}T(\mathrm{log}K+\mathrm{log}T)})\) where $\frac{1}{2}\leq \alpha\leq 1$. The proposed algorithm can be viewed as a generalization of Rarely Switching OFUL \citep{Abbasi-Yadkori} by capitalizing on a tight confidence bound for the parameter in each phase obtained through independent rewards in the same phase.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lihong_Li1
Submission Number: 405
Loading