\section{Multi-objective Multi-Armed Bandits}
%
\subsection{Domination and Efficient Pareto Optimal Arms}
The main objective of solving multi-objective optimization problems is to identify a set of solutions that are close to the Pareto-Optimal (PO) solutions, while simultaneously ensuring a high degree of diversity within the objective space. This involves two orthogonal goals: first, to approximate the PO set as closely as possible, guaranteeing that the solutions reflect the best trade-offs among the competing objectives; and second, to maintain sufficient diversity among these solutions, covering various regions of the objective space. Such diversity is crucial for addressing different preferences among decision-makers effectively \citet{coello2007evolutionary}. These goals, and particularly the first one, can be extended naturally to online optimization and stochastic multi-objective multi-armed bandits (MO-MAB).
Given an instance of a MO-MAB problem with \( n \) arms, denoted by \(\mathcal{A}=\{a_1,a_2,...,a_n \}\), and $D$ (maximization) objectives \( \mathcal{F}=\{f_1, f_2, \dots, f_D\} \), let the reward vector of arm \( a \in \mathcal{A}\) at time (or round) \( t \) be denoted as:
\[
\mathbf{r}^t(a) = (r_{1}^t(a), r_{2}^t(a), \dots, r_{D}^t(a)),
\]
where \( r_{d}^t(a) \) is the reward of arm \( a \) in the \( d \)-th objective (or dimension) at time \( t \). In the stochastic setting
\[
\sum_{t=1}^T \mathbf{r}^t(a) = (T \mu_{1}(a), T \mu_{2}(a), \dots, T\mu_{D}(a)), 
\]

where $\mu_d(a)$ is the expected reward of arm $a$ in dimension $d$.

In the multi-objective context, an arm \( a \) is said to \textit{dominate} another arm \( b \) (denoted as \( a \succ b \)) if and only if \( a \) is at least as good as \( b \) in all objectives, and is strictly better in at least one objective. Formally, for two reward vectors \( \mathbf{r}^t(a) = (r_{1}^t(a), r_{2}^t(a), \dots, r_{D}^t(a)) \) and \( \mathbf{r}^t(b) = (r_{1}^t(b), r_{2}^t(b), \dots, r_{D}^t(b)) \), arm \( a \) dominates \( b \) if:

\begin{itemize}
    \item \(
r_{d}^t(a) \geq r_{d}^t(b) \quad \text{for all } d \in [D], \quad \text{and} 
\)

\item \(
r_{d}^t(a) > r_{d}^t(b) \quad \text{for at least one } d\in [D],
\)
\end{itemize}
where $[D]$ denotes the first $D$  positive integers, $[D]=\{1, 2, \dots, D\}$. Similarly, \( a \) is said to \textit{weakly dominate} \( b \) (denoted as \( a \succeq b \)) if and only if \( a \) is at least as good as \( b \) in all objectives, i.e., \(
r_{d}^t(a) \geq r_{d}^t(b) \quad \text{for all } d \in [D]\).
So, an arm \( a \in \mathcal{A} \) is a \textit{non-dominated}, if there does not exist any other arm \( b \in  \mathcal{A}\) such that \( b \) dominates \( a \) (i.e., \(\forall b \in  \mathcal{A}: b \nsucc a\)). 
Finally, the set of all non-dominated arms is called the \textit{Pareto-optimal} (PO) set and denoted by \( \mathcal{A^*} \). % which includes all arms that are not dominated by any other arms in \( \mathcal{A}\). %These arms represent the best trade-offs across objectives, as improving one objective further would necessarily degrade performance in at least one other objective.
 The image of PO arms in the objective space is called \textit{Pareto Front}.



In online optimization and MAB, the iterative nature of the process shifts the objective from single-instance optimization to maximizing cumulative rewards over a given number of rounds. Unlike classical optimization, where a single decision is made, the decision-maker in this context selects (or pulls) from available choices iteratively over \( T \) rounds. Therefore, we introduce the concept of the \textit{Efficient Pareto-Optimal} (EPO) set, which represents an efficient subset of the PO set, specifically tailored to maximize cumulative performance in an iterative decision-making setting. To this end, we need to extend the definition of domination to a subset of arms.

Let \(S=\left ( a^1,a^2,\dots ,a^T \right) \) be the sequence of arms selected by a decision-maker (arm \( a^t \) at round \( t \in [T] \) is pulled). So, the cumulative reward vector of $S$ is given by:

\[
\bar{\mathbf{r}}(S) = \sum_{t=1}^T \mathbf{r}(a^t).
\]

We can restrict the sequence of selected arms to any subset of arms. Let \( X \subseteq \mathcal{A}^* \) and \( Y \subseteq \mathcal{A}^* \) be two different subsets of PO arms, and \( S_X \) and \( S_Y \) be two sequences of arms selected (only) from $X$ and $Y$, respectively. 
We say \( X \) weakly dominates \( Y \), denoted \( X \succeq Y \), if for any sequence \( S_Y \), there exists a sequence \( S_X \) such that the cumulative reward vector of \( S_X \) weakly dominates that of \( S_Y \).
Formally,

\[
X \succeq Y \iff \forall S_Y \in Y, \, \exists S_X \in X \text{ such that } \bar{\mathbf{r}}(S_X) \succeq \bar{\mathbf{r}}(S_Y).
\]

Since the domination relation between sets involves convex combinations of arms with real-valued weights, while arm selections in each round are discrete (integer-valued), the domination relationship requires a sufficiently large number of rounds to achieve the desired weight approximation through integer pulls. More precisely, for the domination to hold, there must exist a threshold \(T_0\) beyond which the integer constraints allow for a sufficiently accurate rational approximation of the required convex weights:

\begin{align*}
X &\succeq Y \iff \forall S_Y \in Y, \, \exists S_X \in X \colon
  &\exists T_0 \in \mathbb{N}, \forall T \geq T_0, \, \bar{\mathbf{r}}(S_X) \succeq \bar{\mathbf{r}}(S_Y),
\end{align*}

where \(T_0\) depends on the precision required to approximate the optimal convex weights through integer-valued arm pulls.
In terms of the domination relation between sets of individual arms, two scenarios are possible: either one set dominates the other, or the sets are non-dominated with respect to each other. %Specifically, if \( X \cap Y \neq \emptyset \), then \( X \) and \( Y \) are non-dominated with respect to each other. 
Now, we can define EPO set \(\mathcal{EA}^*\) as

\begin{align}
\mathcal{EA}^* = \left\{ a^* \in \mathcal{A}^* | \nexists S \subseteq \mathcal{A}^* \setminus \{a^*\}, \bar{r}(S) \succeq \bar{r}(\{a^*\}) \right\}
\end{align}



As an example, assume a simple instance of a MO-MAB problem with three PO arms $a$, $b$ and $c$ with reward vectors \( \mathbf{r_a} = (1,0) \), \( \mathbf{r_b} = (0,1) \), and \( \mathbf{r_c} = (\epsilon, \epsilon) \) for some small positive value \( \epsilon \). In this scenario, when a decision-maker pulls arms \( a \) and \( b \) iteratively, the average reward vector (the cumulative reward is a similar discussion) achieved is \( \left( \frac{1}{2}, \frac{1}{2} \right) \). However, if the decision-maker pulls arm \( c \), the resulting average reward vector would be \( \left( \epsilon, \epsilon \right) \), which is lower than the first case for \( \epsilon \leq \frac{1}{2} \). In this sense, \( \{a, b\} \succ \{c\} \).
%in terms of average reward and under the condition that the arms are pulled \textit{fairly}. Similarly, \( \{a, b\} \) also dominates \( \{a, c\} \) and \( \{b, c\} \).

\begin{comment}
    


\textbf{Here}, \textit{dominates} means that the subset \( \{a, b\} \) always**** results in a higher average reward compared to any other possible subset of PO solutions. In other words, for any priority weight \( w = (w_1, w_2) \) on the objectives, where \( w_1 + w_2 = 1 \) and \( w_1, w_2 \geq 0 \), the linear combination of the average rewards for \( \{a, b\} \) is greater than that for any other subset. Formally,

\begin{align*}
\bar r_w(\{a,b\}) = w_1 \bar r_1(\{a,b\}) +  w_1 \bar r_2 (\{a,b\}) \geq \\
\bar r_w(\{a,b,c\}) = w_1 \bar r_1(\{a,b,c\}) +  w_1 \bar r_2 (\{a,b,c\}),
\end{align*}

where \( \bar{r}_i(s) \) represents the average reward in the \( i \)-th objective when all arms in the set \( s \) are pulled. Since this inequality holds for any priority weight configuration that a decision-maker may consider, \( \{a, b\} \) always dominates \( \{a, b, c\} \) or any subset that includes arm \( c \). Thus, \( \{a, b\} \) is an \textit{efficient PO set}.
This dominance relationship holds as long as \( \epsilon \leq \frac{1}{2} \). In this example, the Pareto front is non-convex for \( \epsilon \leq \frac{1}{2} \) and convex when \( \frac{1}{2} < \epsilon \leq 1 \).

The concept of the efficient PO set is particularly relevant due to the iterative nature of online optimization. In an offline setting, where a decision-maker selects only one arm in a single shot, a solution like \( c \) can still serve as a trade-off option and might be chosen by the decision-maker. However, in iterative settings, subsets that maximize cumulative rewards over $T$ rounds, such as \( \{a, b\} \) in this example, become the more effective choices.
\end{comment}

%The above simple example has a symmetry on the position of the three PO arms $a$, $b$ and $c$. So, a fairly (e.g., each is pulled \(\frac{T}{2}\) times) pulling of arms results in \textit{always} set $(\{a,b\})$ dominates arm $c$. Now, what if the arms in set $(\{a,b\})$ are not pulled, fairly? Assume $a$ is pulled $n_a$ times and $b$ is pulled $n_b$ times, where $n_a + n_b = T$. In this case, the average reward vector will be a linear combination of the reward vector of $a$ and $b$. As shown in Figure \ref{EfficientPareto1}, this average vector dominates any other arm which lies below and left side of it (see the gray region). However, changing $n_a$ and $n_b$ can cover all the region below the PO front, but for a fixed $n_a$ and $n_b$, there are two rectangular regions not dominated by the average vector $n_a a + n_b b$. On the other hand, for any arm $c$, there is a range of $n_a$ and $n_b$ that yield the average vector $n_a a + n_b b$ dominates $c$ (see Figure \ref{EfficientPareto2}). Note that. if $c$ lies above the Pareto front line between $a$ and $b$, no linear combination of $a$ and $b$ can dominate it. Now, let generalize these concept and formally define Efficient PO set.

 %%%
In this example, the positions of the arms form a symmetric configuration. For this symmetry reason, if arms \( a \) and \( b \) are pulled equally often, the cumulative reward vector \( \{a, b\} \) \textit{always} dominates the reward of arm \( c \).
In the general case, suppose arm \( a \) is pulled \( n_a \) times and arm \( b \) is pulled \( n_b \) times, where \( n_a + n_b = T \). So, the resulting average reward vector is a linear combination of the reward vectors of \( a \) and \( b \) with the weights \(w_a = \frac{n_a}{T}\) and \(w_b = \frac{n_b}{T}\), respectively. As illustrated in Figure~\ref{EfficientPareto12} (the left panel), this average vector will dominate any other arm whose reward vector is located within the gray region below and to the left of it. Although adjusting \( n_a \) and \( n_b \) can potentially cover the entire area below the line segment on the Pareto front between \( a \) and \( b \), for any fixed combination of \( n_a \) and \( n_b \), there remain two triangular regions that are not dominated by it.
However, on the reverse side, for any arm \( c \) below the Pareto segment connecting \( a \) and \( b \) on the Pareto front, there exists a specific range of values for \( n_a \) and \( n_b \) such that the average vector \( \frac{n_a}{T} a + \frac{n_b}{T} b \) dominates \( c \) (illustrated in the right panel of Figure~\ref{EfficientPareto12}). 

\begin{figure}[h]
    \centering
    \subfloat[]{
        \begin{tikzpicture}[scale=1.8]
            \fill[gray!20] (0.6,0.4) -- (0.6,0) -- (0,0) -- (0,0.4) -- cycle;

            \draw[->] (-0.2,0) -- (1.3,0) node[right] {$f_1$};
            \draw[->] (0,-0.2) -- (0,1.3) node[above] {$f_2$};

            \draw[gray, thin] (0.6,0.4) -- (0.6,0);
            \draw[gray] (0.6,0.4) -- (0,0.4);

            \draw[blue, thick] (1,0) -- (0,1);

            \filldraw[black] (1,0) circle (2pt) node[below right] {$a$};
            \filldraw[blue] (0.6,0.4) circle (1.5pt) node[above right] {$w_a a + w_b b$};
            \filldraw[black] (0,1) circle (2pt) node[above left] {$b$};
        \end{tikzpicture}
        \label{EfficientPareto1}
    }
    \hspace{0.5cm}
    \subfloat[]{
        \begin{tikzpicture}[scale=1.8]
            \fill[gray!20] (0.2,0.2) -- (0.2,1.1) -- (1.1,0.2) -- cycle;

            \draw[->] (-0.2,0) -- (1.5,0) node[right] {$f_1$};
            \draw[->] (0,-0.2) -- (0,1.5) node[above] {$f_2$};

            \draw[blue, thick] (1,0) -- (0.2,0.2);
            \draw[blue, thick] (0.2,0.2) -- (0,1);

            \draw[gray, thin] (0.2,0.2) -- (0.2,1.1);
            \draw[gray] (0.2,0.2) -- (1.1,0.2);

            \draw[black, dashed] (1,0) -- (0,1);
            \draw[black, very thick] (0.2,0.8) -- (0.8,0.2);

            \filldraw[black] (1,0) circle (2pt) node[below right] {$a$};
            \filldraw[black] (0.2,0.2) circle (2pt) node[below left] {$c$};
            \filldraw[black] (0,1) circle (2pt) node[above left] {$b$};
        \end{tikzpicture}
        \label{EfficientPareto2}
    }
    \caption{Dominance regions in MO-MAB. (a) Dominating region of a linear combination of two PO arms with rewards $(1,0)$ and $(0,1)$. (b) An MO-MAB instance with three arms: $a=(1,0)$, $b=(0,1)$, and $c=(0.2,0.2)$. $c$ is dominated by linear combinations of $a$ and $b$ in the gray region.}
    \label{EfficientPareto12}
\end{figure}





Note that, in this example, \( \{a, b\} \succ \{c\} \) holds while \( c \) lies below the line segment. If \( c \) lies above it, no pair of \( n_a \) and \( n_b \)  will be able to dominate \( c \). This means that no linear combination of $a$ and $b$ can dominate \(c\). In general, an arm $a$ at a given time $t$ is weakly dominated by some linear combinations of \(a_1,a_2,\dots, a_l\), if and only if, there exist \(\alpha=<\alpha_1,\alpha_2,...,\alpha_l>  \) such that 
\(
\sum_{i=1}^l \alpha_i r^t(a_i) \succeq r^t(a)
\), where 
\(
\sum_{i=1}^l \alpha_i = 1
\).


Where, $r^t(a_i) = \mu(a_i)$ in the stochastic MO-MAB. Thus, the set \( \mathcal{EA}^* \) represents those PO arms that lie in the convex position on the Pareto front. Therefore, after finding $\mathcal{A}^*$, the arms of $\mathcal{EA}^*$ can be efficiently found in polynomial time using a linear programming approach with $n^*-1$ variables and $D$ constraints, where $n^* = |\mathcal{A}^*|$, and we do not need to find the convex hull of the arms which has exponential complexity. As a result of such a linear programming problem, if there is no feasible solution \(\alpha=(\alpha_1,\alpha_2,...,\alpha_l)  \), arm $a$ belongs to $\mathcal{EA}^*$, otherwise, at least one vector $\alpha$ that \(\sum_{i=1}^l \alpha_i r^t(a_i) \) weakly dominates $r^t(a)$ is computed.



\textbf{A practical example.} Multi-objective Vehicle Routing Problems (VRPs) constitute a major operational challenge for large-scale logistics providers such as DHL, Amazon, and national postal services, where 1,000–10,000 daily delivery tasks must be optimized under competing objectives with unknown priority weights. These include minimizing unassigned tasks, travel distance and duration, territorial overlap among routes, and soft-constraint violations (e.g., time-window penalties), while simultaneously maximizing route compactness—objectives that span incommensurable scales (monetary costs versus spatial quality concepts). At such scales, exact optimization is computationally prohibitive, rendering metaheuristic methods the only viable solution approach.
Contemporary metaheuristics primarily employ ruin-and-recreate operators that partially dismantle incumbent solutions and reconstruct them using insertion heuristics. Modern systems combine diverse ruin strategies (e.g., cluster removal, worst-job removal, neighbor removal) with multiple recreate strategies (e.g., regret-based, cheapest, or gap-based insertion), producing more than sixty operator configurations whose performance is highly problem-dependent and shaped by spatial structure, temporal constraints, and regulatory requirements \citet{vidal2012heuristics}. Each configuration can be viewed as an arm in a multi-armed bandit framework, yielding multi-dimensional rewards across objectives. Unlike experimental testing phases, each iteration directly impacts operational performance—poor exploration translates to immediate costs and service degradation. The solver must therefore maintain diverse efficient strategies while minimizing cumulative regret over thousands of iterations.

Conventional Pareto-based multi-armed bandit formulations are insufficient in this context because they treat all non-dominated arms equivalently. Consequently, a configuration offering negligible (e.g., 0.01\%) uniform improvements is regarded as equally valuable as one delivering substantial progress (e.g., a 0.5\% reduction in distance) despite secondary objective trade-offs. Yet cumulative solution quality over generations depends on achieving meaningful aggregated gains. Two complementary Pareto-efficient strategies that each produce substantial improvements in distinct objectives can jointly yield far greater cumulative progress than uniformly weak Pareto strategies. The notion of efficient Pareto-optimal arms directly addresses this limitation by distinguishing genuinely promising configurations from those providing only trivial non-dominated gains.

\begin{comment}
    

This example motivates the concept of an \textit{Efficient PO set}, which we define as the subset of PO arms that maximize cumulative rewards for some weighted linear combination of objectives over time. Formally, we define the Efficient PO set \( \mathcal{EA}^* \subseteq \mathcal{A}^* \) as follows:

\[
\mathcal{EA}^* = \left\{ a^* \in \mathcal{A}^* : \forall s \subseteq \mathcal{A}^*, \exists w \text{ such that } \bar{r}_w(\{a^*\}) > \bar{r}_w(s) \right\}
\]

where \( w = (w_1, w_2, \dots, w_m) \) is a priority weight vector for the objectives, with \( \sum_{i=1}^m w_i = 1 \) and \( w_i \geq 0 \) for all \( i \in [D] \). The set \( \mathcal{EA}^* \) thus represents those PO arms that can be optimal for some combination of objective weights, capturing the arms that lie on the convex hull of the Pareto front. In contrast, any PO arm that lies in a non-convex region of the Pareto front will be dominated by some linear combination of its neighboring arms. 


Therefore, we define the \textit{efficient PO set} \( \mathcal{EA}^* \subseteq \mathcal{A}^* \) as the subset of PO arms that remains optimal under some weighted linear combination of objectives, given any priority weights. Formally, this is the set of Pareto arms that lie on the convex hull of the Pareto front. Specifically, any PO arm that lies in a non-convex region of the Pareto front is dominated by a linear combination of other arms (at least by the linear combination of its neighboring arms on the Pareto front). Thus,

\[
\mathcal{EA}^* = \left\{ a^* \in \mathcal{A}^* : \forall s \subseteq \mathcal{A}^*, \exists w \text{ such that } \bar{r}_w(\{a^*\}) > \bar{r}_w(s) \right\},
\]

where \( w = (w_1, w_2, \dots, w_m) \) is a priority weight vector on the objectives, with \( \sum_{i=1}^m w_i = 1 \) and \( w_i \geq 0 \) for all \( i \in [D] \).

\end{comment}








\subsection{Definition of Regret in MO-MAB}
%
In MO-MAB problems, since there is more than one PO arm, the definition of \textit{regret} is more complex. Most studies over the last decade have relied on the definition introduced in \citet{drugan2013designing}, which is discussed in Section \ref{Drugan regret}. In this paper, we propose a more comprehensive definition of regret that is better suited to MO-MAB. Our definition is formulated to be applicable to both PO arms and EPO arms.


An algorithm $Alg$ that chooses an arm \( a^t \) at time $t =1,2,\dots, T$, has a regret \( R\) compared to an arm \( a^* \in \mathcal{A^*} \) (or alternatively an EPO arm \( a^* \in \mathcal{EA^*} \)), if %for all \( d \in [D]\), the following holds:
\[
\mathbb{E} \left[ \sum_{t=1}^T r_{d}^t(a^*) - \sum_{t=1}^T r_d^t(a) \right] \leq R, \quad \forall d \in [D].
\]
In the stochastic setting, we can replace \( \mathbb{E} \left[ \sum_{t=1}^T r_{d}^t(a^*) \right] \) with \(T \mu_d(a^*)\). Let denote this concept by \(R_{Alg,a^*}(T) \leq {R}\). %, where \( \mu_d(a^*) \) is the mean reward of arm $a^*$ in objective $f_k$.
%For example, it is sufficient to find a solution (pulling an arm $a^t$, in $t=1,2,\ldots,T$) that has a regret $R_d$ comparing to \(a^*)\) in just one of the objectives, say $d$. So, there are no guarantees that such a solution is better than \(a^*)\) in terms of the other objectives. However, here in the above definition, we emphasize that the solution has at most regret $R_d$ comparing to \(a^*)\) in all the objectives, $d = 1, 2, \dots, D$.
Notably, the above definition of regret is stronger than the definition of "non-domination" used in \citet{drugan2013designing}. Here, $R_{\text{Alg},a^*}(T)$ guarantees that the outcome is not only non-dominated compared to $a^*$ but also dominates it if it is improved by the regret value. Specifically, it is necessary to satisfy the inequality across all objectives, rather than just in one objective.

Now, let us extend $R_{\text{Alg},a^*}(T)$ to the case where \( Alg \) selects a set of arms \( A^t \) at each round \( t = 1, 2, \ldots, T \) and we compare it to all PO arms. In this case, we say that \( Alg \) is \( R \)-regret, and denote it by \( R_{\text{Alg}}(T) \leq R \), if the following regret properties are met:


\subsection*{1. Coverage-Regret}  
For any PO arm \( a^* \in \mathcal{A}^* \) (or alternatively EPO arm \( a^* \in \mathcal{EA}^* \)), there exists an arm \( a^t \in A^t \) such that:

\begin{align}
 \mathbb{E} \left[ \sum_{t=1}^T r_{d}^t(a^*) - \sum_{t=1}^T r_d^t(a^t) \right] \leq R, \quad \forall d \in [D].
\end{align}
or
\begin{align}
 T \mu_d(a^*) - \mathbb{E} \left[ \sum_{t=1}^T r_d^t(a^t) \right] \leq R, \quad \forall d \in [D].
\end{align}

This property ensures that the algorithm can effectively approximate any PO (EPO) arm in terms of the reward vectors \( r_d \) across all dimensions \( d \) and over \( T \) iterations.

\begin{comment}

\subsection*{2. Convergence Property}  
The set \( A^T \) converges to the set of PO arms \( \mathcal{A}^* \) (or \( \mathcal{EA}^* \)) as \( T \to \infty \).

This property ensures that the probability of \( Alg \) identifies and pulls PO (or EPO) arms increases as \( T \to \infty \).

\end{comment}

\subsection*{2. Cumulative Adjustment-Regret}  
The cumulative minimal adjustment required for the arms in \( A^t \) to weakly dominate some PO arm (or EPO arm) satisfies:
\[
\mathbb{E} \left[ \sum_{t=1}^T \sum_{a^t \in A^t} \min \{ \epsilon \geq 0 \mid \exists a^* \in \mathcal{A}^* : a^t + \epsilon \succeq a^* \} \right] \leq |\mathcal{A}^*| R.
\]
Alternatively, we can replace $\mathcal{A}^*$ with $\mathcal{EA}^*$.

This property bounds the cumulative regret associated with the adjustments needed for the arms selected by the algorithm to weakly dominate the optimal set \( \mathcal{A}^* \) (or \( \mathcal{EA}^* \)).

\subsection*{Regret Interpretation}
%
We define two complementary regret properties that jointly characterize algorithm performance: coverage-regret ensures completeness by requiring that all PO (or EPO) arms are well-approximated, while cumulative adjustment-regret enforces minimality by penalizing unnecessarily large sets of selected arms. These two properties are complementary, as they measure orthogonal aspects of solution quality.
The coverage-regret property evaluates how well the algorithm approximates the PO arms in the Pareto front. Specifically, it measures the discrepancy (in terms of rewards across all dimensions \( d \)) between a PO arm \( a^* \in \mathcal{A}^* \) (or \( a^* \in \mathcal{EA}^* \)) and the nearest arm \( a^t \in A^t \). For each iteration \( t \), the algorithm selects an arm \( a^t \in A^t \) that minimizes:
\[
\epsilon = \max_{d \in [D]} \left( r_d^t(a^*) - r_d^t(a^t) \right).
\]
Thus, the selected arm \( a^t \) ensures that the adjusted (virtual) arm \( a^t + \epsilon \) weakly dominates \( a^* \). The total expected regret over \( T \) iterations is bounded by \( R \), indicating the quality of the coverage provided by the algorithm for the Pareto front.
As a complement, the cumulative adjustment-regret property evaluates the inverse relationship, quantifying the cumulative minimal adjustment required for the arms \( A^t \) to weakly dominate the optimal set \( \mathcal{A}^* \) (or \( \mathcal{EA}^* \)). While a larger set \( A^t \) reduces the coverage-regret, it increases the cumulative adjustment-regret due to the additional adjustments needed to align the selected arms with the optimal arms. This trade-off implies that \( A^t \) should ideally approximate the Pareto front with a size proportional to \( |\mathcal{A}^*| \), ensuring bounded cumulative adjustment-regret.

%\subsection*{Trade-Off Between Coverage and cumulative adjustment-regret}  
The interplay between coverage-regret and cumulative adjustment-regret highlights the balance required in selecting a representative set \( A^t \). Increasing the size of \( A^t \) improves coverage (reducing the coverage-regret) but may result in a higher cumulative adjustment-regret. To optimize performance, \( A^t \) should approximate \( \mathcal{A}^* \) as closely as possible while maintaining a manageable size, leading to a bounded cumulative adjustment-regret of \( |\mathcal{A}^*| R \).
%\subsection*{Relation to Single-Objective Multi-Armed Bandits }  
For single-objective MAB problems, $D=1$, the above definitions reduce to the classical notion of regret: both coverage-regret and cumulative adjustment-regret collapse to the standard formulation $\mathbb{E} \left[ \sum_{t=1}^T r^t(a^*) - \sum_{t=1}^T r^t(a^t) \right] \leq R$, where $a^*$ is the unique optimal arm. Our regret framework is comprehensive by design, requiring algorithms to compete against $\mathcal{A}^*$ across all dimensions $d=1,\ldots, D$ simultaneously, rather than merely excelling in a subset of objectives. 
This stringent requirement distinguishes our approach from previous formulations, where superiority in just one dimension suffices for non-domination. While the bound appears dimension-independent in its functional form, the underlying algorithmic complexity scales with $D$ due to the requirement for comprehensive coverage. Furthermore, in multi-objective settings, the regret value in the coverage-regret property is non-negative when applied to  $\mathcal{EA}^*$ but may be negative for $\mathcal{A}^*$ due to the non-convexity of the Pareto front. This behavior arises because the iterative decision-making process in MAB allows selecting subsets of arms that collectively achieve superior rewards compared to individual non-convex PO arms.

