Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Published: 25 Feb 2026, Last Modified: 25 Feb 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-armed bandit (MAB) problems are widely applied to online optimization tasks that require balancing exploration and exploitation. In practical scenarios, these tasks often involve multiple conflicting objectives, giving rise to multi-objective multi-armed bandits (MO-MAB). Existing MO-MAB approaches predominantly rely on the Pareto regret metric introduced in \citet{drugan2013designing}. However, this metric has notable limitations, particularly in accounting for all Pareto-optimal arms simultaneously. To address these challenges, we propose a novel and comprehensive regret metric that ensures balanced performance across conflicting objectives. Additionally, we introduce the concept of \textit{Efficient Pareto-Optimal} arms, which are specifically designed for online optimization. Based on our new metric, we develop a two-phase MO-MAB algorithm that achieves sublinear regret for both Pareto-optimal and efficient Pareto-optimal arms.
Certifications: J2C Certification
Beyond Pdf: zip
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In Revision #2 and #3, we apply the comments by the second and third reviewers: Please see the files "response_to_reviewer #2.pdf" and "response_to_reviewer #3.pdf" for the details.
Supplementary Material: pdf
Assigned Action Editor: ~Stefan_Magureanu1
Submission Number: 6238
Loading