Keywords: Contextual bandits, Reinforcement learning, Adaptive control, Iterative learning, Statistical learning
TL;DR: In this paper, we show that the Greedy algorithm for contextual bandits with imperfect observations attains a logarithmic growing regret in terms of the horizon.
Abstract: Contextual bandits are widely-used models in reinforcement learning for incorporating both generic and idiosyncratic factors in reward functions. The existing approaches rely on full observation of the context vectors, while the problem of learning optimal arms from partially observed contexts remains immature. We show that in the latter setting, decisions can be made more guarded to minimize the risk of pulling sub-optimal arms. More precisely, efficiency is established for Greedy policies that treat the estimates of the unknown parameter and of the unobserved contexts as their true values. That includes non-asymptotic high probability regret bounds that grow logarithmically with the time horizon and linearly with the number of arms. Numerical results that showcase the efficacy of avoiding exploration are provided as well.