Information-Directed Offline-to-Online Reinforcement Learning

Keru Chen

Information-Directed Offline-to-Online Reinforcement Learning

Keru Chen

Published: 25 May 2026, Last Modified: 11 Jun 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: offline-to-online reinforcement learning, information-directed sampling, residual uncertainty

TL;DR: IDS targets residual uncertainty in offline-to-online RL, outperforming Thompson Sampling by balancing regret and information gain.

Abstract: Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(\\chi;\\tau\_{1:T}\\mid\\mathcal\\lbrace D\\rbrace\_N)$ between a learning target $\\chi$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $\\eta\\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($\\eta=0$) satisfies $\\widetilde O\\!\\left(Hd\\min\\left\\lbrace\\sqrt T,\\,T\\sqrt\\lbrace C^\\dagger\_{\\beta,\\mathrm\\lbrace IDS\\rbrace\_0}(N,T)/N\\rbrace\\right\\rbrace\\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

Submission Number: 39

Loading