\section{A simple Multi-Objective MAB Algorithm}

We introduce the first MO-MAB algorithm with a sublinear regret bound that satisfies both the coverage-regret and cumulative adjustment-regret properties, achieving a regret of 
\(
R = O\left( T^\frac{2}{3} (n \log T)^\frac{1}{3} \right).
\)
Additionally, we demonstrate that the algorithm’s outcomes converge to the PO set as \( T \to \infty \). The algorithm is initially explained and analyzed for PO arms, and we later extend the approach to efficiently handle EPO arms as well. 
This algorithm has two phases: In the exploration phase, each arm is explored by pulling it a fixed number of times ($T'$) to estimate the average reward for each arm. In the exploitation phase, the set of a minimum number of arms that cover all arms is identified ($B$) and pulled until iteration $T$. The pseudocode of the algorithm is represented below. The notation $a + 2r$ represents adding the scalar $2r$ to each dimension of arm $a$'s reward vector.


\begin{algorithm}[h]
%\scriptsize
\caption{A Simple MO-MAB Algorithm}
\label{SMOMABA}
\begin{algorithmic}[1]
\STATE \textbf{Input:} Number of arms $n$, time horizon $T$
\STATE Set $T' = \left( \frac{T}{n}\right)^\frac{2}{3} \left( 2 \log T \right)^\frac{1}{3}$
\STATE Pull each arm $T'$ times, and compute the average reward vector $\bar{\mu}(a)$ for all arms $a \in \mathcal{A}$
\STATE Set the confidence radius $r = \sqrt{\frac{2 \log T}{T'}}$
\STATE Remove all arms $a \in \mathcal{A}$ if there exists some $a' \in \mathcal{A}$ such that $a' \succeq a + 2r$
\FOR{any arm $a \in \mathcal{A}$}
    \STATE Compute list $Dom(a) = \{ a' \in \mathcal{A} : a + 2r \succeq a' \}$
\ENDFOR
\STATE Compute the minimum set of arms $B$ such that $\bigcup_{b \in B} Dom(b) = \mathcal{A}$
\FOR{$t = T' + 1$ \textbf{to} $T$}
    \STATE Pull all arms $b \in B$
\ENDFOR
\STATE \textbf{return} $B$
\end{algorithmic}
\end{algorithm}






\begin{comment}
    % The version of algorithm that defines number of domination of an arm i by an arm j (instead of number of pulls)

\begin{algorithm}[H]
\caption{Multi-Objective UCB1 (MO-UCB)}
\label{MO_UCB1}
\begin{algorithmic}[1]
\State \textbf{Input:} Number of arms $n$, time horizon $T$, number of objectives $m$
\State Pull each arm $i \in [n]$ once and update $\mathbf{r}_i$, $n_i$ and $n_{ij}$ as
\begin{itemize}[left=2em]  % Adjust the left margin of the entire list
    \item $n_i = 1$ (number of times arm $i$ has been pulled)
    \item $\mathbf{r_i} = (r_{i1}, r_{i2}, ..., r_{im})$, where $r_{ij}$ is the cumulated reward of arm $i$ in the $j$-th objective.
    \item $n_{ij} = 0$ (number of times arm $i$ dominates arm $j$)
\end{itemize}

\State \textbf{Set} $N = n$ (total number of pulls).

\For{$t = n+1$ \textbf{to} $T$}
    \State Compute \(\hat{\mu}_{ik} = \frac{r_{ik}}{n_i}, \forall i \in [n]\)
    \State Compute \(UCB_k(i,j) = \hat{\mu}_{ik} + \sqrt{\frac{2 \log T}{1+n_{ij}}}\)
    \State Find the Pareto set \( A^t \) such that \( \forall i \in A^t, \, \forall j \neq i, \)
            \State $\quad $ (\textit{i}) \( \forall k \in [D]: UCB_k(i,j) \geq UCB_k(j,i) \), and
            \State $\quad $ (\textit{ii}) \( \exists k \in [D]: UCB_k(i,j) > UCB_k(j,i) \)
    \State Extract the efficient PO arms $EA^t$ from $A^t$ 
    \For{each arm $i \in EA^t$}
    \State Pull arm $i$ and observe the reward $o_i = (o_{i1}, o_{i2}, ..., o_{im})$
    \State Update $n_i = n_i + 1$
    \State Update the cumulative rewards $r_i = r_i + o_i$
    \For {each \( j \neq i \)} 
        $n_{ij} = n_{ij} + 1$ \textbf{if} 
            \State (\textit{i}) \( \forall k \in [D]: UCB_k(i,j) \geq UCB_k(j,i) \), and
            \State (\textit{ii}) \( \exists k \in [D]: UCB_k(i,j) > UCB_k(j,i) \)
    \EndFor
    \EndFor
    \State $N = N + |EA^t|$
\EndFor
\State return $EA^T$

\end{algorithmic}
\end{algorithm}

\end{comment}


%It is notable, the proposed MO-UCB algorithm treats all PO arms equally, meaning that selecting and pulling a PO arm does not depend on its position in the Pareto front. This \textit{fairness} implies that the algorithm behaves symmetrically concerning all PO solutions. We call this property 

%\textbf{Fairness property}. Since the algorithm \ref{SMOMABA} pulls all non-dominated arms in $EA^T$, so, by design, it is fair in selecting an efficient PO arms. 

%Indeed, the probability of an arm \(a^* \in \mathcal{P}^*\) is pulled by MO-UCB depends on the number of arms which are close to \(a^*\) in the objective space, particularly, the arms which can dominates \(a^*\) with an improvement by the exploration term \(O(\sqrt{\frac{\log t}{n_a}})\). Thus, for example, when we focus on an extreme PO arm with the highest $f_1$ value and show some facts such that all other arms can be close to it in the worst case, these facts can be extended to other PO arms as well.




%Note that, it is assumed $T$ is known, otherwise, we can apply doubling technique.


\textbf{Complexity.} 
%
After \(T'\) rounds, Algorithm \ref{SMOMABA}, in Step 5, removes all the arms that are clearly dominated by some other arms. This step takes \(O(D n^2)\) time %(or by more advanced algorithms in $O(n \log^{D-1} n)$ time \citet{jensen2003reducing}) 
and is valid because, at this stage, all arms are \textit{clean} with a radius \(r\), as established in Theorem \ref{Coverage Theorem}. Importantly, no Pareto-optimal (PO) arms are discarded during this step. The primary purpose of Step 5 is to improve the algorithm's expected complexity. However, for worst-case analysis, this step can be omitted without affecting the theoretical regret guarantees of the algorithm.

Next, the algorithm computes \( \text{Dom}(a) \) for each arm \(a\), which represents the set of all arms \(a'\) weakly dominated by \(a + 2r\). Following this, the algorithm performs the minimum set cover computation of arms in Step 9. Specifically, it identifies the smallest number of improved arms, each enhanced by a factor of \(2r\), that weakly dominate all other arms. This set is called the \textit{minimum set covering} of arms.

The set cover problem is a well-known NP-hard problem \citet{vazirani2003approximation}, and in our scenario, it can be solved exactly with a time complexity of \(O(2^n \cdot n)\) in the worst case, assuming no arms are removed in Step 5. Consequently, the overall complexity of Algorithm \ref{SMOMABA} is \(O(2^n \cdot n)\). However, in the expected case, the number of PO arms is polylogarithmic in $n$ \citet{bentley1978average}. This means the average running time of Algorithm \ref{SMOMABA} is $O(2^{(\log n)^{D-1}} \cdot (\log n)^{D-1})$ even for computing the exact minimum set cover, e.g., $O(n \log n)$ time for $D=2$.

To address the worst-case time complexity of Algorithm \ref{SMOMABA}, we can employ an approximation algorithm that achieves an approximation ratio of \(O(\log n)\) \citet{vazirani2003approximation}. This allows us to reduce the complexity to polynomial time while obtaining a solution that is within a factor of \(O(\log n)\) of the optimal set cover size. Therefore, we have two options:
\begin{enumerate}
    \item Compute the optimal set cover in \(O(2^n \cdot n)\) time.
    \item Utilize an \(O(\log n)-\)approximation algorithm to obtain a set cover in polynomial time.
\end{enumerate}
It is important to note that the approximate solution guarantees that the size of the set cover will be at most \(O(\log n)\) times larger than the optimal solution. This trade-off between optimality and computational efficiency provides flexibility depending on the specific requirements and constraints of the application.

In the following, we demonstrate the impact of the two computational approaches on the complexity and regret of Algorithm \ref{SMOMABA}. The first option, which computes the exact minimum set cover, results in an exponential-time algorithm with a regret bound of \(R = O\left( T^\frac{2}{3} (n \log T)^\frac{1}{3} \right)\). The second option, employing the \(O(\log n)-\)approximation approach, yields a polynomial-time complexity algorithm with a regret bound of \(R = O\left( \log n \cdot T^\frac{2}{3} (n \log T)^\frac{1}{3} \right)\).

The regret analysis of the proposed algorithm is summarized in the following theorems, with detailed proofs deferred to the Appendix.

%\textbf{Definition:} A \textit{Selection Sequence} \( \{ a^t \}_{t=1}^T \) is selecting exactly one arm $a^t \in P^t$ for $t=1,2,\ldots, T$.

\begin{theorem}
\label{Coverage Theorem}
The coverage-regret of algorithm \ref{SMOMABA} is \(R = O\left( T^\frac{2}{3} (n\log T)^\frac{1}{3}\right)\).
% In other words, for any PO solution \( a^* \in \mathcal{P^*}\), there exists an arm \(a^t \in P^t\) such that 
% \[
% \sum_{t=1}^T r_{d}(a^*) - \sum_{t=1}^T r_d(a^t) \leq R_d \quad \forall d = 1, 2, \dots, D
% \]
% , where \(R_d = O( \sqrt{K T \log T})\) for all \(d=1,2,..,D\).
\end{theorem}



\begin{theorem}
\label{Convergence Theorem}
The outcome of Algorithm \ref{SMOMABA} converges to the PO arms \( \mathcal{A}^* \) as \( T \to \infty \).
\end{theorem}





\begin{comment}

Because of the fairness property of MO-UCB, assume \(a^*\) is the arm with maximum $f_1$ (or \(\mu_{a^*1}\)) value. Without loss of generality, assume there is no other solution with such $f_1$ \footnote{If there is another arm $b$ such that \(\mu_{1} = \mu_{b1}\), we can ignore the dimension $f_1$ and follow the proof by the next dimension that the true mean reward arm $a$ is greater than the true mean reward $a$. Note that, since we assumed $a$ is a PO arm, such a dimension will always exist.}. 
So, at least in one of the objectives, i.e., $d=1$, \(\mu_{a^*1}\) is greater than any other arm. Thus, in round $T$, the arm \({a^*}\) is not in $P^T$, if there is a solution $a$ such that $a$ dominates $a^*$. As discussed in the proof of \ref{Coverage Theorem}, for such an arm $a$, the following inequality is hold 

\[
\Delta_1^t(a^*,a^t) = \mu_1(a^*) - \mu_1(a^t) \leq 2r^t(a^t) = 2 \sqrt{\frac{2 \log T}{n^t(a^t)}}.
\]

Since \(\Delta_1^t(a^*,a^t)\) is positive, 

\[
n^t(a^t) \leq \frac{8}{(\Delta_1^t(a^*,a^t))^2} \log T
\]

Considering \(T \to \infty \), \(\frac{8}{(\Delta_1^t(a^*,a^t))^2}\) is a constant, and so therefore \(n^t(a^t) = O(\log T)\). Therefore, number of times that an arm \(a^t\) dominates \(a^*\) and is pulled limited to \( O(\log T)\). Considering all the arms, this will limited to \( O(K \log T)\) times. So, the probability of \( a^*\) does not appear in \(P^T\) is \( O(\frac{K \log T}{T})\). As a result when \(T \to \infty \), the probability of \( a^*\) appears in $P^T$ approaches to 1. Simultaneously, since the probability of a dominated arm \(a\) appears in \(P^T\) is \( O(\frac{K \log T}{T})\), which approaches to 0 by \(T \to \infty \).

% We show that the number of times that \(UCB_{a^*1}\) is less than \(UCB_{a1}\) is at most 
% $\frac{8}{\Delta^2} \log T$. Then in total, \(a^*\) can be at most \(\frac{8K}{\Delta^2} \log T\) times is dominated by other arms. Which means probability of dominated is less than  \(O(\frac{\log T}{T})\) which approaches to 0 by increasing $T$.


    


Thus, the exploration term diminishes relative to the average reward, leading to:
   \[
   \text{UCB}_{id}^t \to \mu_{id} \text { for all } d\in [D] \quad \text{as } n_i \to \infty
   \]
where  



UCB estimates converge to the true means.
   Then MO-UCB identifies the non-dominated set \( Q^t \) based on these converging UCB scores. As \( T \) increases, the UCB scores increasingly reflect the true Pareto relationships among the arms, ensuring that only those arms that are truly non-dominated remain in \( Q^t \). Finally, it  filters these non-dominated arms to form the strong non-dominated set \( P^t \), which includes arms that are not \(\epsilon\)-dominated by any other arm. With \( \epsilon_i = 2 \sqrt{\frac{\log T}{n_i}} \), the tolerance becomes negligible as \( T \) grows, thus allowing identification of PO arms as found in $Q^T$.
\end{comment}


%\textbf{Proposition:} As \( T \to \infty \), the set \( B \) converges to the efficient PO set \( \mathcal{EA}^* \), and consequently, the average regret of algorithm \ref{SMOMABA} approaches zero.



\begin{lemma}
\label{lemmaSetCover}
    In case of the clean event $C$ happens, the optimal solution for the minimum set covering of arms, computed in Step 9 of Algorithm \ref{SMOMABA}, is bounded by \( |\mathcal{A^*}| \).
\end{lemma}



\begin{theorem}
\label{cumulative adjustment-regret Theorem}
The cumulative adjustment-regret is hold for algorithm \ref{SMOMABA} with the regret \(R = O\left( T^\frac{2}{3} (n\log T)^\frac{1}{3}\right)\). 
\end{theorem}




\begin{theorem}
\label{Efficient Pareto Theorem}
In Algorithm \ref{SMOMABA}, if after computing the minimum arm covering set \( B \), the non-efficient arms are removed, Theorem \ref{Coverage Theorem}, Theorem \ref{Convergence Theorem}, and Theorem \ref{cumulative adjustment-regret Theorem} remain valid for the efficient PO arms, \(\mathcal{EA^*}\).
\end{theorem}



Before concluding the theoretical results, we elucidate an additional property of the proposed MO-MAB algorithm, which pertains to the concept of \textit{diversity} in classical multi-objective optimization problems (MOPs) \citet{coello2007evolutionary}, and its implications for computing regret in MAB approaches that compare the outcome with only one best arm. While our analysis focuses on all PO arms, the conclusions are equally applicable to EPO arms.

Diversity, a critical secondary objective in MOPs, concerns the challenge of identifying a representative subset of PO solutions that is as diverse as possible within the objective space. Since multi-objective optimization algorithms typically generate a limited subset of solutions from a potentially vast set of PO solutions, ensuring adequate diversity in this subset is crucial. The proposed Algorithm \ref{SMOMABA} addresses this by selecting the minimal set of covering arms, which are improved by a radius \( r = \sqrt{\frac{2 \log T}{T'}} \). This implies that if multiple PO arms exist within a specific region of the Pareto front, the algorithm prioritizes a minimum subset of these arms capable of dominating all others in the same region (when improved by a factor of \( 2r \)). %For instance, consider two arms \( a \) and \( b \) such that both \( a + r \succeq b \) and \( b + r \succeq a \) hold. In this scenario, the algorithm will select only one of these arms for pulling after iteration \( t = T' \). Hence, the algorithm effectively ensures that, within a radius \( r \) in the objective space, only a single arm is selected.


Consider a (maximal) diverse set of PO arms, denoted as \( \text{DPO} \), characterized by a radius \( r = \sqrt{\frac{2 \log T}{T'}} \). 
Define a virtual arm \( b^* \) whose reward is the coordinate-wise average of the arms in DPO:

\[
r_d(b^*) = \frac{1}{|\text{DPO}|} \sum_{a^* \in \text{DPO}} r_d(a^*), \quad \forall d \in [D].
\]

The arm \( b^* \) can be interpreted as the expected reward achievable by a decision-maker who restricts his choices to a diverse subset of Pareto-optimal arms within a specified radius \( r \). This approach is an optimal strategy for efficiently covering all regions of the Pareto front. Another interpretation of this best arm $b^*$ is the cumulative reward if the decision-maker randomly chooses one Pareto-optimal arm in DPO in each round.
Importantly, this definition extends the concept of selecting the "best" arm from single-objective multi-armed bandit (MAB) problems to multi-objective contexts. Unlike a simple weighted linear combination of objectives, this formulation focuses on the average reward across the diverse set of PO arms rather than directly aggregating the objective values.



Now, consider a version of Algorithm \ref{SMOMABA} that, after $T'$ rounds, and computing the minimum covering arms $B$, only one arm is randomly selected from $B$ and pulled for rounds $t=T'+1, T'+2, \dots, T$. Thus, the number of all pulls will be $nT'+(T-T')$. In the following, we show that this version of the algorithm is an \(R = O\left( T^\frac{2}{3} (n\log T)^\frac{1}{3}\right)\) compared to $b^*$ introduced above.

\begin{theorem}
\label{Theorem b_star}
  The regret of the single arm pulling version of Algorithm \ref{SMOMABA} compared to the average best arm $b^*$ is \(R = O\left( T^\frac{2}{3} (n\log T)^\frac{1}{3}\right)\).
\end{theorem}


\begin{proof}
    The proof is similar to the poof of Theorem \ref{cumulative adjustment-regret Theorem}. Under the clean events, since all the arms are pulled in the first $T'$ rounds, then the regret value in the first $T'$ iteration of the algorithm is at most $nT'$ in total (as Term 1). Now, assume $b^t$ is the random arm selected from the computed minimum covering arms $B$, so,

    \[
\text{Term 2} = \mathbb{E} \left[ \sum_{t=T'+1}^T \min \{ \epsilon \geq 0 : b^t + \epsilon \succeq b^* \} \right].
\]

As discussed, under the clean event condition, the arms in $B$ are the minimum covering DPO set. That means, improving each arm $b\in B$ with the radius $2r$ will dominate some Pareto arm in DPO. So, selecting uniformly random arms from $B$ in rounds $t=T'+1, T'+2, \dots, T$ results in the expected sum of difference with $r_d(b^*)$ does not exceed $O(2r)$ for all $d \in [D]$. Therefore, the total regret can be written as

\[
\mathbb{E}[T] \leq n T' + O( (T-T') 2 r) \leq n T' + O( T 2 r),
\]

where $c$ is a constant. Replacing \( T'= \left( \frac{T}{n}\right)^\frac{2}{3} \left ( 2 \log T \right)^\frac{1}{3} \) and \( r = \sqrt{\frac{2 \log T}{T'}} \) results in
\(
\mathbb{E}[T] \leq  O\left( T^\frac{2}{3} (n\log T)^\frac{1}{3}\right).
\)
  
\end{proof}

\textbf{Proposition.} When \( T \to \infty \), so \( r \to 0 \), and DPD will equal all the Pareto-optimal arms. Thus, Theorem \ref{Theorem b_star} holds for the definition of the \textit{best} arm, which corresponds to the mean reward of the Pareto-optimal arms.
