Abstract: We study the problem of online learning in contextual bandit problems where
the loss function is assumed to belong to a known parametric function class.
We propose a new analytic framework for this setting that bridges the Bayesian
theory of information-directed sampling due to Russo and Van Roy [2018] and
the worst-case theory of Foster, Kakade, Qian, and Rakhlin [2021] based on the
decision-estimation coefficient. Drawing from both lines of work, we propose a
algorithmic template called Optimistic Information-Directed Sampling and show
that it can achieve instance-dependent regret guarantees similar to the ones achiev-
able by the classic Bayesian IDS method, but with the major advantage of not
requiring any Bayesian assumptions. The key technical innovation of our analysis
is introducing an optimistic surrogate model for the regret and using it to define
a frequentist version of the Information Ratio of Russo and Van Roy [2018], and
a less conservative version of the Decision Estimation Coefficient of Foster et al.
[2021].
Format: Long format (up to 8 pages + refs, appendix)
Publication Status: Yes
Submission Number: 62
Loading