Balancing exploration and exploitation in Partially Observed Linear Contextual Bandits via Thompson Sampling
Keywords: Reinforcement Learning, Contextual Bandits, Partial Observations
TL;DR: This paper is an analysis of Thompson for contextual bandits with partial observations.
Abstract: Contextual bandits constitute a popular framework for studying the exploration-exploitation trade-off under finitely many options with side information. In the majority of the existing works, contexts are assumed perfectly observed, while in practice it is more reasonable to assume that they are observed partially. In this work, we study reinforcement learning algorithms for contextual bandits with partial observations. First, we consider different structures for partial observability and their corresponding optimal policies. Subsequently, we present and analyze reinforcement learning algorithms for partially observed contextual bandits with noisy linear observation structures. For these algorithms that utilize Thompson sampling, we establish estimation accuracy and regret bounds under different structural assumptions.
Submission Number: 123
Loading