Abstract: We consider the problem of online regret minimization in stochastic linear bandits with access to prior observations (\emph{i.e.,} offline data) from the underlying bandit model. This setting is highly relevant to numerous applications where extensive offline data is often available, such as in recommendation systems, personalized healthcare, and online advertising. Consequently, this problem has been studied intensively in recent works such as~\cite{banerjee2022artificial, wagenmaker2022leveraging, agrawal2023optimal,hao2023leveraging,cheung2024leveraging}. We introduce the Offline-Online Phased Elimination (OOPE) algorithm, that effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. We show that OOPE achieves an online regret is $\tilde{O}(\sqrt{d_{\text{eff}} T \log \left(|\mathcal{A}|T\right)}+d^2)$, where $\mathcal{A}$ is the action set, $d$ is the dimension and $T$ is the online horizon. $d_{\text{eff}} \hspace{0.1cm} (\leq d)$ is the \emph{effective problem dimension} which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(\lambda_k)_{k \in [d]}$ of the Gram matrix of the offline data. Thus the eigen-spectrum $(\lambda_k)_{k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($d_{\text{eff}} \approx d$), we recover the established regret bounds for purely online linear bandits. Conversely, when offline data is abundant ($T_{\text{off}} \gg T$) and well-explored ($d_{\text{eff}} = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm \footnote{Optimal within log factors in $T, T_{\text{off}}$ and additive constants in $d$} in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{d_{\text{eff}} } \min \{ d_{\text{eff}},1\} \right)$, which can be substantial in high dimensions with moderate quality of offline data $d_{\text{eff}} = \Omega(1)$.
Certifications: J2C Certification
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In response to various reviewers' comments, we have made a number of changes to the paper. The following changes are pertinent for your comments:
$\underline{\text{Intuition for $d_{eff}$}}:$ We have provided a paragraph mentioning in detail the aforementioned intuition and estimation for $d_{eff}$ in bottom of page 7. We also provide examples where we calculate the $d_{eff}$ for specific action-sets and offline data distribution. We hope this clarifies the notion of $d_{eff}$ and why it plays an important role in our analysis.
$\underline{\text{Making OOPE horizon-free}}:$ We add Remark 4.7 in bottom of page 11 describing how to make OOPE horizon-free.
$\underline{\text{Gap in upper and lower bounds}}:$ We add Remark 4.12 explaining in detail why we believe the gap between upper and lower bounds is only analytical and stems from a potentially loose bound we employ in Lemma 4.4.
$\underline{\text{Benchmark against pure online phased elimination.}}:$ We updated Figure 1 with pure online phased elimination to benchmark the benefits obtained by employing OOPE in the OO setting.
$\underline{\text{Experimental results comparing OOPE and OOPE-FW}}:$ We have added an experimental section where we extensively compare OOPE across varying $d,T, \lambda(V\_{\pi\_{off}})$ and $d\_{eff}$ whose parameters are given in Table 5 and results in Figure 4. We give a detailed discussion on the obtained results.
Code: https://github.com/vtnahsus/Offline_Online_Linear_Bandits_TMLR_paper_2026
Assigned Action Editor: ~Chicheng_Zhang1
Submission Number: 6773
Loading