Multi-Objective Bandits with Hierarchical Preferences: A Thompson Sampling Approach

ICLR 2026 Conference Submission16056 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Thompson Sampling, multi-objective optimization, bandit
TL;DR: We present the first Thompson Sampling framework for multi-objective bandits with hierarchical preferences.
Abstract: This paper studies multi-objective bandits with hierarchical preferences, a class of bandit problems where arms are evaluated according to multiple objectives, each with a distinct priority level. The agent aims to maximize the most critical objective first, followed by the second most important, and so on for subsequent objectives. We address this problem using Thompson Sampling (TS), a well-known Bayesian decision-making strategy. Although TS has been extensively studied in single-objective bandit settings, its effectiveness in lexicographic bandits remains an open question. To fill this gap, we propose two TS-based algorithms for lexicographic bandits: **(i)** For Gaussian reward distributions, we introduce an multi-armed bandit algorithm that achieves a *problem-dependent regret bound* of $O(\sum\frac{\log(mKT)}{\Delta^i(a)})$, where $\Delta^i(a)$ denotes the suboptimality gap for the objective $i\in[m]$ and arm $a\in[K]$, and $m$ is the number of objectives. **(ii)** For unknown reward distributions, we design a stochastic linear bandit algorithm with a *minmax regret bound* of $\widetilde{O}(d^{3/2}\sqrt{T})$, where $d$ is the dimension of the contextual vectors. These results highlight the adaptability of TS strategy to the lexicographic bandit problem, offering efficient solutions under varying degrees of knowledge about the rewards. Empirical experiments strongly support our theoretical findings.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 16056
Loading