Online Regret Minimization in Linear Bandits with Offline data.

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline to Online, Linear Bandits, Regret Minimization, D-optimal design
TL;DR: Tight upper and lower regret bounds for offline to online setting in Linear Bandits.
Abstract: We study hybrid offline-to-online regret minimization in stochastic linear bandits, where an agent leverages prior offline logs to accelerate online adaptation. To safely and optimally incorporate this historical data, we introduce Offline-Online Phased Elimination (OOPE), an algorithm utilizing an extended D-optimal experimental design. We show OOPE achieves an online regret of $\tilde{O}(\sqrt{d_{eff} T \log \left(|\mathcal{A}|T\right)}+d^2)$, where the ``effective dimension" $d_{eff} (\leq d)$ quantitatively captures the quality and coverage of the offline dataset via the eigenspectrum of its Gram matrix. This bound smoothly bridges the gap between purely online learning ($T_{off}=o(T), d_{eff}=d$) and regimes with abundant, well-explored offline data $(T << T_{off}, d_{eff} =o(d)$) where regret is substantially reduced. Furthermore, we derive the first minimax lower bounds for this setting that explicitly depend on offline data quality, establishing that OOPE is near-optimal in both well-explored and poorly-explored regimes. Finally, we propose a Frank-Wolfe variant (OOPE-FW) that strictly improves the additive $O(d^2)$ support term, yielding better performance when offline data provides moderate coverage.
Submission Number: 78
Loading