Leveraging priors on distribution functions for multi arm bandits

Sumit Vashishtha; Odalric-Ambrym Maillard

Leveraging priors on distribution functions for multi arm bandits

Sumit Vashishtha, Odalric-Ambrym Maillard

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bayesian nonparametric statistics, Online learning, multi arm bandits

Abstract: We introduce Dirichlet Process Posterior Sampling (DPPS), a Bayesian non-parametric al- gorithm for multi-arm bandits based on Dirichlet Process (DP) priors. Like Thompson- sampling, DPPS is a probability-matching algorithm, i.e., it plays an arm based on its posterior- probability of being optimal. Instead of assuming a parametric class for the reward generating distribution of each arm, and then putting a prior on the parameters, in DPPS the reward gener- ating distribution is directly modeled using DP priors. DPPS provides a principled approach to incorporate prior belief about the bandit environment, and in the noninformative limit of the DP priors (i.e. Bayesian Bootstrap), we recover Non Parametric Thompson Sampling (NPTS), a popular non-parametric bandit algorithm, as a special case of DPPS. We employ stick-breaking representation of the DP priors, and show excellent empirical performance of DPPS in chal- lenging synthetic and real world bandit environments. Finally, using an information-theoretic analysis, we show non-asymptotic optimality of DPPS in the Bayesian regret setup.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Sumit_Vashishtha1

Track: Fast Track: published work

Publication Link: https://openreview.net/forum?id=WzC1Hr3Kak#discussion

Submission Number: 69

Loading