Regularized-OFU: an efficient algorithm for general contextual bandit with optimization oracles

Yichi Zhou; Shihong Song; Huishuai Zhang; Jun Zhu; Wei Chen; Tie-Yan Liu

Regularized-OFU: an efficient algorithm for general contextual bandit with optimization oracles

Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan Liu

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Abstract: In contextual bandit, one major challenge is to develop theoretically solid and empirically efficient algorithms for general function classes. We present a novel algorithm called \emph{regularized optimism in face of uncertainty (ROFU)} for general contextual bandit problems. It exploits an optimization oracle to calculate the well-founded upper confidence bound (UCB). Theoretically, for general function classes under very mild assumptions, it achieves a near-optimal regret bound $\Tilde{O}(\sqrt{T})$. Practically, one great advantage of ROFU is that the optimization oracle can be efficiently implemented with low computational cost. Thus, we can easily extend ROFU for contextual bandits with deep neural networks as the function class, which outperforms strong baselines including the UCB and Thompson sampling variants.

Supplementary Material: zip

5 Replies

Loading