\section{Related Work}
Traditional Multi-Armed Bandit (MAB) problems focus on maximizing a single cumulative reward \citet{slivkins2019introduction}, but many real-world applications involve multiple conflicting objectives. This complexity has led to the development of Multi-Objective Multi-Armed Bandits (MO-MAB), where the goal is to simultaneously maximize multiple objectives by pulling Pareto-optimal arms. A key challenge in this area is the evaluation of MO-MAB algorithms, particularly the design of an appropriate regret metric.

Drugan and Nowé \citeyear{drugan2013designing} introduced the first Pareto regret framework for MO-MAB, combining scalarization with an exploration-exploitation algorithm called \textit{Pareto-UCB1}, which demonstrated logarithmic regret bounds. Their empirical studies \citet{drugan2014pareto} validated the algorithm on multi-objective Bernoulli distributions. However, challenges remain in balancing performance across objectives and ensuring diversity along the Pareto front. These limitations of the Pareto regret metric and Pareto-UCB1 are discussed in Section \ref{Drugan regret}. Before delving into these issues, we provide an overview of related work and how recent studies have built on the Pareto regret framework.

Mahdavi et al. \citeyear{mahdavi2013stochastic} applied stochastic convex optimization techniques to MO-MAB, introducing scalarization functions (e.g., Chebyshev and linear scalarization) to address the complexity of optimizing multiple objectives under uncertainty. Yahyaa et al. \citeyear{yahyaa2014annealing} developed the \textit{Annealing-Pareto} algorithm, which adapts the annealing concept to MO-MAB, dynamically adjusting exploration intensity to improve the trade-off between exploration and exploitation. Yahyaa and Manderick \citeyear{yahyaa2015thompson} applied Thompson sampling to MO-MAB, selecting arms based on their posterior distributions to manage uncertainty.

Busa-Fekete et al. \citeyear{busa2017multi} proposed a MAB framework using the generalized Gini index, allowing the algorithm to prioritize better trade-offs and balance multiple objectives more effectively. They provided both theoretical analysis and empirical results showing improvements in exploration and exploitation. \textit{Öner et al.} \citeyear{oner2018combinatorial} extended MAB by studying the combinatorial MO-MAB problem, where multiple arms can be selected at each round. Their algorithm combines exploration-exploitation strategies with combinatorial optimization to address conflicting objectives and constraints. Lu et al. \citeyear{lu2019multi} developed a framework for generalized linear bandits, applying regret minimization techniques to handle multi-objective settings, offering both theoretical guarantees and empirical validation.

Xu and Klabjan \citeyear{xu2023pareto} extended the Pareto regret framework to adversarial settings in MAB, presenting algorithms for both stochastic and adversarial scenarios. However, their focus on optimizing a single objective led to suboptimal solutions in multi-objective contexts, where only one direction of the Pareto front was considered.

Turgay et al. \citeyear{turgay2018multi} integrated contextual information into MO-MAB, optimizing the decision-making process by considering contextual relationships between arms and objectives. H{\"u}y{\"u}k and Tekin \citeyear{huyuk2021multi} developed an algorithm incorporating lexicographical ordering (prioritizing objectives) and satisfying (ensuring objectives exceed thresholds). Their analysis offered insights into improving the efficiency of MO-MAB. Xue et al. \citeyear{xuemultiobjective} extended this work by generalizing lexicographically ordered MO-MAB from priority-based regret to general regret.

Cheng et al. \citeyear{cheng2024hierarchize} proposed an algorithm that hierarchizes Pareto dominance to improve regret minimization by prioritizing objectives and refining exploration strategies. Ararat and Tekin \citeyear{ararat2023vector} introduced a framework based on a polyhedral ordering cone to define directional preferences among vector rewards, replacing traditional Pareto dominance, and established gap-dependent and worst-case sample complexity bounds for it. Crépon et al. \citeyear{garivier2024sequential} studied the challenge of learning and identifying Pareto-optimal arms under the stochastic multi-armed bandit framework. They presented an algorithm to guarantee suboptimality relative to the true Pareto front. Building upon this, Karag{\"o}zl{\"u} et al. \citeyear{karagozlu2024learning} addressed the challenge of learning the Pareto-optimal set under incomplete preferences in a pure exploration setting, providing an algorithm to identify Pareto-optimal arms.

In general, the approaches aiming to minimize Pareto regret, as defined by \citet{drugan2013designing}, either directly (e.g., \citet{xu2023pareto}) or indirectly (e.g., \citet{drugan2014pareto}), often focus on optimizing along a single objective direction. While this can result in non-dominated solutions in one dimension, it fails to capture the entire Pareto front. To address these limitations, we propose a new regret metric that comprehensively evaluates performance across multiple objectives.