\section{Introduction}
%
Multi-armed bandits (MAB) form a foundational framework in machine learning with wide-ranging applications in online optimization tasks such as recommendation systems and online advertising \citet{slivkins2019introduction}. The objective in MAB problems is to sequentially select actions (arms) that balance exploration (learning about unknown arms) and exploitation (maximizing cumulative rewards). In many real-world scenarios, decisions involve trade-offs among multiple conflicting objectives, which has led to the study of multi-objective multi-armed bandits (MO-MAB). In MO-MAB, the goal is to identify Pareto-optimal (PO) solutions that represent the best trade-offs among the objectives, making this problem highly relevant in areas such as multi-criteria decision-making \citet{wei2021multi}.

Despite growing interest in MO-MAB, most existing work relies on the Pareto regret metric introduced by Drugan et al. \citet{drugan2013designing}. Although this metric serves as a useful starting point and has been widely adopted, it has several significant limitations. Specifically, the metric focuses on the minimum distance of an arm’s reward vector to the Pareto front in one direction, neglecting the performance across other objectives. This can lead to situations where algorithms optimizing a single objective are evaluated as highly effective, even though they may fail to balance other objectives (e.g., see \citet{xu2023pareto}). Moreover, the metric inadequately penalizes poor performance in unoptimized objectives and fails to ensure diversity in the objective space (for a detailed discussion of this metric, see Section \ref{Drugan regret}). Consequently, existing regret metrics are insufficient for fully evaluating the performance of MO-MAB algorithms.

To address these limitations, a more comprehensive regret definition that accounts for multiple objectives is crucial for a more accurate evaluation of algorithms in multi-objective settings. In this paper, we make the following key contributions:

\begin{itemize} 
\item We propose a novel and comprehensive regret metric for MO-MAB that overcomes the limitations of existing metrics by simultaneously considering all objectives. \item We introduce the concept of \emph{Efficient Pareto-Optimal} (EPO) arms, tailored for online optimization settings, to capture the round-based nature of MO-MAB problems better. 
\item We develop a two-phase explore-exploit algorithm that achieves sublinear regret for both PO and EPO arms. For $n$ arms over $T$ rounds, it offers two implementations: an exponential-time variant with regret \( O\left( T^\frac{2}{3} (n \log T)^\frac{1}{3} \right) \), and a polynomial-time variant achieving \( O\left( \log n \cdot T^\frac{2}{3} (n \log T)^\frac{1}{3} \right) \).


\end{itemize}