Regret minimization in Linear Bandits with offline data via extended D-optimal exploration.

TMLR Paper6773 Authors

02 Dec 2025 (modified: 12 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We consider the problem of online regret minimization in stochastic linear bandits with access to prior observations (\emph{i.e.,} offline data) from the underlying bandit model. This setting is highly relevant to numerous applications where extensive offline data is often available, such as in recommendation systems, personalized healthcare, and online advertising. Consequently, this problem has been studied intensively in recent works such as~\cite{banerjee2022artificial, wagenmaker2022leveraging, agrawal2023optimal,hao2023leveraging,cheung2024leveraging}. We introduce the Offline-Online Phased Elimination (OOPE) algorithm, that effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. We show that OOPE achieves an online regret is $\tilde{O}(\sqrt{d_{\text{eff}} T \log \left(|\mathcal{A}|T\right)}+d^2)$, where $\mathcal{A}$ is the action set, $d$ is the dimension and $T$ is the online horizon. $d_{\text{eff}} \hspace{0.1cm} (\leq d)$ is the \emph{effective problem dimension} which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(\lambda_k)_{k \in [d]}$ of the Gram matrix of the offline data. Thus the eigen-spectrum $(\lambda_k)_{k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($d_{\text{eff}} \approx d$), we recover the established regret bounds for purely online linear bandits. Conversely, when offline data is abundant ($T_{\text{off}} \gg T$) and well-explored ($d_{\text{eff}} = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm \footnote{Optimal within log factors in $T, T_{\text{off}}$ and additive constants in $d$} in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{d_{\text{eff}} } \min \{ d_{\text{eff}},1\} \right)$, which can be substantial in high dimensions with moderate quality of offline data $d_{\text{eff}} = \Omega(1)$.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Chicheng_Zhang1
Submission Number: 6773
Loading