Adaptively Phased Algorithm for Linear Contextual Bandits

Adaptively Phased Algorithm for Linear Contextual Bandits

TMLR Paper405 Authors

03 Sept 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We propose a novel algorithm for the linear contextual bandit problem when the set of arms is finite. Recently the minimax expected regret for this problem is shown to be $\Omega(\sqrt{dT\mathrm{log}T\mathrm{log}K})$ with $T$ rounds, $d$-dimensional contexts, and $K\leq 2^{d/2}$ arms per time. Previous works on phased algorithms attain this lower bound in the worst case up to logarithmic factors \citep{Auer, Chu11} or iterated logarithmic factors \citep{Li19}, but require a priori knowledge of the time horizon $T$ to construct the phases, which limits their use in practice. In this paper we propose a novel phased algorithm that does not require a priori knowledge of $T$, but constructs the phases in an adaptive way. We show that the proposed algorithm guarantees a regret upper bound of order $O(d^{\alpha}\sqrt{T\mathrm{log}T(\mathrm{log}K+\mathrm{log}T)})$ where $\frac{1}{2}\leq \alpha\leq 1$. The proposed algorithm can be viewed as a generalization of Rarely Switching OFUL \citep{Abbasi-Yadkori} by capitalizing on a tight confidence bound for the parameter in each phase obtained through independent rewards in the same phase.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Lihong_Li1

Submission Number: 405

Loading