Contextual Value Iteration and Deep Approximation for Bayesian Contextual Bandits

Kevin Duijndam; Ger Koole; Rob van der Mei

Contextual Value Iteration and Deep Approximation for Bayesian Contextual Bandits

Kevin Duijndam, Ger Koole, Rob van der Mei

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0

Keywords: Contextual Multi-Armed Bandit, Bayesian Belief MDP, Value Iteration, Deep Value Function Approximation, Online Learning

TL;DR: We formulate contextual bandit problems as Bayesian belief-MDPs and solve them by value iteration—exactly on small grids, and via a dual-stream neural network value function for scale—achieving low regret on pricing and arm-uncertainty benchmarks.

Abstract: We present a Bayesian value-iteration framework for contextual multi-armed bandit problems that treats the agent’s posterior distribution for the pay-off as the state of the Markov Decision Process. We apply finite-dimensional priors on the unknown reward parameters, and the exogenous context transition kernel. Value iteration on the belief-MDP yields an optimal policy. We illustrate the approach in an airline seat-pricing simulation. To address the curse of dimensionality, we approximate the value function with a dual-stream deep learning network and benchmark our deep value iteration algorithm on a standard contextual bandit instance.

Submission Number: 18

Loading