Keywords: Contextual Multi-Armed Bandit, Bayesian Belief MDP, Value Iteration, Deep Value Function Approximation, Online Learning
TL;DR: We formulate contextual bandit problems as Bayesian belief-MDPs and solve them by value iteration—exactly on small grids, and via a dual-stream neural network value function for scale—achieving low regret on pricing and arm-uncertainty benchmarks.
Abstract: We present a Bayesian value-iteration framework for contextual multi-armed bandit problems that treats the agent’s posterior distribution for the pay-off as the state of the Markov Decision Process. We apply finite-dimensional priors on the unknown reward parameters, and the exogenous context transition kernel. Value iteration on the belief-MDP yields an optimal policy. We illustrate the approach in an airline seat-pricing simulation. To address the curse of dimensionality, we approximate the value function with a dual-stream deep learning network and benchmark our deep value iteration algorithm on a standard contextual bandit instance.
Submission Number: 18
Loading