Keywords: nonlinear contextual bandits, diffusion model, Langevin Monte Carlo, posterior sampling
TL;DR: Learn a diffusion prior from past tasks and sample new-task reward model parameter via conditional reverse diffusion (DLTS/DPSG), yielding better exploration and lower regret in multi-task nonlinear contextual bandits.
Abstract: We study multi-task nonlinear contextual bandits, where different tasks share the same reward structure but are characterized by distinct model parameters drawn from a common unknown prior distribution. The goal is to leverage information from past tasks to minimize regret on a new task with limited online interactions. Thompson Sampling (TS) is a popular approach for solving contextual bandits, maintaining a posterior over the model parameter that is updated each round using a hand-specified conjugate prior (e.g., Gaussian) and the observed rewards. However, such priors cannot capture the rich cross-task structure in multi-task settings, leading to misspecified posteriors and suboptimal exploration. To address this, we train a diffusion model on data from past tasks to learn a flexible prior distribution over task parameters. In a new bandit task, parameters are estimated via a conditional reverse-diffusion process, where each step combines: (i) an unconditional drift from the diffusion prior, (ii) a likelihood-driven drift from the interaction history, and (iii) a noise term enabling randomized exploration. We instantiate this framework in two ways. DLTS integrates history into the diffusion prior at every reverse step to form a conditional posterior, from which approximate samples are drawn. DPSG first performs unconditional reverse sampling from the pretrained diffusion prior and then applies a single history-guided gradient correction. Both methods adhere to the same framework but differ in how they incorporate interaction history from the new task: DLTS explicitly constructs the conditional posterior, while DPSG provides a lightweight approximation by coupling unconditional sampling with one corrective step. In theory, we formalize oracle TS (OTS) and its diffusion counterpart (ODTS) and prove they are equivalent when the diffusion prior matches the true prior. We bound the per-round expected regret gap between ODTS and OTS by the cumulative score estimation error across diffusion levels. Our empirical evaluation demonstrates that our proposed methods are competitive with specialized baselines in linear settings and outperform baselines benefiting from the diffusion prior in challenging nonlinear bandit environments.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23333
Loading