% !TeX root = ..\freeExp.tex
\section{Introduction}\label{sec:intro}

% \color{blue}

% Revision Plans:

% \begin{enumerate}
%     \item Add new paragraphs to highlight the challenge of (1) devising an algorithm to utilize the free exploration; (2) how to address the unique analysis challenges in our new bandits model.
%     \item If have time, enhance the motivations: (1) more detailed examples; (2) pound the usage of ``agent-specific cost'' terminology.
% \end{enumerate}

% \subsection{Motivate the Model from Agent-Specific Cost}

% In many real world MAB applications, an agent needs to pay cost for pulling an arm. For example, in cognitive radio, a smartphone (agent) spends different amounts of electric power (cost) to access channels (arm) in different frequency band. The quality of the channel when the smartphone accessing it corresponds to reward.
% Typically, the traditional MAB frameworks subtract the cost from the reward and take the modified reward into account.
% However, when it comes to \MATOB, this modification results in a heterogeneous reward environment. Because in pulling the same arm, different agents can pay different costs, called \emph{agent-specific costs,} and thus have different modified rewards.
% For example, in the cognitive radio, the agent-specific costs model the different power costs for smartphones in geographical locations to access channels.

% \begin{tabular}{|c|c|}
%     \hline
%     Application                & Agent-Specific Cost                                              \\ \hline
%     Online Advertising         & platforms pay different costs of placing an ad                   \\
%     Online Routing             & users suffer different latencies in using the same path          \\
%     Online Resource Allocation & edges accesses the cloud resource with differ power consumptions \\ \hline
% \end{tabular}

% \color{black}

Multi-armed bandit (MAB)~\citep{lai1985asymptotically,bubeck2012regret}
is a classic sequential decision making problem.
In the stochastic MAB, an agent faces a set \(\mathcal{K}\coloneqq \{1,2,\dots,K\}\) (\(K{\in}\N^+\)) of arms, where each arm $k$ is associated with a reward random variable with unknown mean \(\mu(k)\).
The agent sequentially pulls arms from \(\mathcal{K}\) in $T\in\N^+$ decision rounds
and observes the pulled arm rewards.
The goal of the agent is to maximize its total reward over all decision rounds, which is equivalent to minimizing the total \emph{regret}, i.e., the cumulative
reward difference between the aggregate reward of the optimal arm \(k_*\) with the highest mean
and the agent's sequential choices. To achieve this goal, the agent needs to balance between exploration and exploitation, i.e., either optimistically
choose the arm with high uncertainty in reward (exploration),
or myopically pull the one with high empirical mean
reward (exploitation).



Multi-agent MAB (\MATOB) is an extension of the basic MAB,
where a group of \(M\in\N^+\) agents
(denoted as \(\mathcal{M} \coloneqq \{1,2,\dots, M\}\))
pulls arms from the same arm set \(\mathcal{K}\).
This model has been studied in various settings,
e.g.,
federated bandits~\citep{shi2021federated,shi2021federatedpersonal,zhu2021federated,huang2021federated},
cooperative pure exploration~\citep{hillel2013distributed,tao2019collaborative,karpov2020collaborative},
multi-agent MAB with collision~\citep{boursier2019sic,mehrabian2020practical,shi2021heterogeneous},
and cooperative multi-agent MAB~\citep{landgren2016distributed,martinez2019decentralized,wang2020optimal,wang2020distributed}.

The majority of prior works on \MATOB, with a few exceptions (see Appendix~\ref{sec:related-works}), study a homogeneous reward setting, where the reward distribution of an arm is the same for all agents. The homogeneous reward setting, however, fails to capture agent-specific preferences/limitations. In many real-world applications, the agents represent different clusters of users with specific preferences, or users in different geographical locations with different costs/limits to access the arm set. In such settings, the reward of each arm might be different for different agents.
We refer to Section~\ref{app:app} for a detailed explanation of various application scenarios.


This paper introduces a multi-agent multi-armed bandits problem with heterogeneous reward (\MATOBHR). In \MATOBHR, the reward observed by an agent consists of two components representing arm- and agent-specific terms.  Specifically, when agent \(i\in\mathcal{M}\) pulls arm \(k\in\mathcal{K}\),
the observed reward is \(X_t^\brai(k)=X_{t,\text{arm}}(k) + X_{t,\text{agent}}^\brai(k)\), where \(X_{t,\text{arm}}(k)\) is the arm-specific reward with bounded mean \(\mu(k)\in(0,b)\) (where \(b\) is a positive constant) and \(X_{t,\text{agent}}^\brai(k)\) is the agent-specific reward with mean \(\nu^\brai(k)\). We denote \(\omega^\brai(k) \coloneqq \mu(k) + \nu^\brai(k)\) as the reward mean of this pull.
In \MATOBHR, we assume both \(X_{t,\text{arm}}(k)\) and \(X_{t,\text{agent}}^\brai(k)\) are stochastic and independent. The arm-specific reward mean \(\mu(k)\) is not known to agents, and each agent \(i\) only privately knows its own agent-specific mean values \(\nu^\brai(k), \forall k \in \mathcal{K} \). Further, in the \MATOBHR setting, the agents can broadcast the observed values of the arm-specific term in the total reward (by subtracting the agent-specific reward mean from the observed reward, i.e., \(X_t^\brai(k) - \nu^\brai(k)\)) at no cost. We note that one may consider other settings for \MATOBHR, e.g., known vs. unknown and homogeneous vs. heterogeneous assumptions for the agent-specific reward. We refer to Appendix~\ref{app:hr-four-cases} for a detailed discussion and the connection of each setting to the prior literature.




% \todo{Some of this illustations are repeated in Section~\ref{sec:comparison-AC-MA2B}. Is that ok?}

% \mo{add one paragraph here and talk about Jackie and Lin's paper and explain how your model is an extension of those and include the motivation for those.}



In \MATOBHR, the reward heterogeneity of agents creates a counterintuitive opportunity for \emph{free exploration} of a subset of arms.  With heterogeneous rewards among agents, there might be no global optimal arm(s). In other words, agents may have different \emph{local} optimal arms, i.e., the arms with the largest reward mean are different among agents, so the characterization of the regret of agents becomes more complicated. However, the existence of multiple local optimal arms poses a surprising opportunity to develop a cooperative learning algorithm to explore local optimal arms for free (without cost), share the free observations with others, and significantly improve the total regret among all agents.

% \rev{We note that these local optimal arms are unknown in advance.
% In searching these arms, agents may have already pulled these arms with cost a large number of times before realizing that these arms can be freely explored, which misses the advantage of free exploration. Addressing the issue requires a well-designed cooperative learning algorithm. 

While the idea of free exploration is intuitive, designing a cooperative bandit algorithm that effectively implements this idea is nontrivial. The main challenge is that the local optimal arms are unknown in advance to the bandit agents. Hence, an algorithm should be designed to economically identify the local optimal arms and assign them to agents that can freely explore them and prevent other agents from pulling these arms (with cost).


We note that \MATOBHR could be considered as an extended version of two recent models in the bandits' literature: action-constrained multi-agent multi-armed bandits (\ACMAB)~\cite{yang2022distributed} and grouped \(K\)-armed bandits~\cite{baek2021fair}.
% These models are almost equivalent to each other except for minor differences in how their action constraints arrive. 
% We next illustrate how \MATOBHR covers \ACMAB as a special case.  In \ACMAB, each agent \(i\in\mathcal{M}\) only pulls from a subset of arms \(\mathcal{K}^\brai\subset \mathcal{K}\) and its goal is to find the local optimal arm in $\mathcal{K}^\brai$.
% In \MATOBHR, one can set agent \(i\)'s specific reward \(\nu^\brai(k)\) for arm \(k\) to be \(0\) if \(k\in\mathcal{K}^\brai\), and \(-b\) if \(k\not\in\mathcal{K}^\brai\), where \(b>0\) and \(\mu(k) \in (0,b)\) for all arm \(k\).
% So that arms in \(\mathcal{K}^\brai\) have negative reward means for agent \(i\) and the agent would never pull arms with \(\nu^\brai(k) = -b\), which is equivalent to only having access to arms in the constrained arm set \(\mathcal{K}^\brai\) (see Remark~\ref{rmk:local-arm-set} for a formal reduction).
% Consequently, our \MATOBHR model is an extension to both papers and their motivations are also ours.
The idea of free exploration is applicable to both~\cite{yang2022distributed,baek2021fair}, however, they did not explicitly utilize free exploration in algorithm design, so they fail to achieve optimal performance that takes into account the free exploration. A detailed discussion on both models and their connection to \MATOBHR, and the significance of our results with respect to both models are given in Section~\ref{sec:comparison-AC-MA2B}.

% \mo{add one sentence and mention Lin and Jackie did not explicitly consider the possibility of free exp in their work. and refer to 1.2 for more details.}




It is worth noting that the high-level idea of free exploration has been leveraged in some other bandit settings in the literature~\citep{chen2018incentivizing,shi2021almost}. However, these works considered the problem of incentivizing exploration; specifically, they considered a principal, aiming to learn the global bandit model, offering bonuses to agents to do explorations on the principal's behalf. In these settings, \cite{chen2018incentivizing, shi2021almost} studied free exploration in the sense that the principal pays no cost rather than free exploration in cooperation among agents. Hence, these works are in clear contrast to the idea of free exploration in \MATOBHR introduced in this paper.
A comprehensive comparison to related works are presented in Appendix~\ref{sec:related-works}.
%     As far as we know, this paper proposes the first algorithm to utilize free exploration among multi-agent cooperation.


\begin{table*}[!t]
    \caption{A simple example with three agents and three arms (\(b>\mu(1) > \mu(2) > \mu(3) > 0\)). The entries of the table show the total reward of each arm for each agent, e.g., $\omega^{(1)}(1)=\mu(1)$ or $\omega^{(3)}(2)=\mu(2)-b<0$. Arms 1, 2, and 3 are the local optimal arms of agents 1, 2, and 3, respectively. On the right-hand side, denoting \(\Delta(i,j) = \mu(i) - \mu(j)\), the regret of our work is compared with a classic non-cooperative algorithm~\citep{auer2002using} and the works of~\cite{yang2022distributed} and~\cite{baek2021fair} as two special cases of \MATOBHR.}
    % \vspace{-3mm}
    \label{tab:simple-example}
    \begin{tabular}{|c|c|c|c|}
        \hline
                    & Arm \(1\)  & Arm \(2\)  & Arm \(3\)  \\\hline\hline
        Agent \(1\) & \(\mu(1)\) & \(\mu(2)\) & \(\mu(3)\) \\\hline
        Agent \(2\) & \(<0\)     & \(\mu(2)\) & \(\mu(3)\) \\\hline
        Agent \(3\) & \(<0\)     & \(<0\)     & \(\mu(3)\) \\\hline
    \end{tabular}
    \quad
    \begin{tabular}{|l||l|}
        \hline
        \texttt{UCB}~\citep{auer2002using}          & \(O\left( \left( \frac{1}{\Delta(1,2)}
        + \frac{1}{\Delta(1,3)} + \frac{1}{\Delta(2,3)} \right)\log T \right) \)             \\\hline
        \texttt{CO-UCB}~\citep{yang2022distributed} & \(O\left( \left( \frac{1}{\Delta(1,2)}
        + \frac{1}{\Delta(2,3)} \right)\log T \right) \)                                     \\\hline
        \texttt{KL-UCB}~\citep{baek2021fair}        & \(O\left(\log \log T \right) \)        \\\hline
        \texttt{FreeExp}~(our work)                 & \(O(1)\)                               \\\hline
    \end{tabular}
\end{table*}

\subsection{Contributions}
In this paper, we first present the \MATOBHR model and highlight its real-world applications. Then, we propose \FreeExp, a cooperative algorithm designed to enable free exploration in the learning process.
Finally, we characterize a regret lower bound that explicitly captures the impact of free exploration on \MATOBHR, and show that the regret of \FreeExp matches the regret lower bound up to a constant factor.
The contributions of this work are:
% The details of our contribution are as follows.

\noindent
{\bf Modeling and practical relevance of \MATOBHR: }
We present the \MATOBHR model in Section~\ref{sec:model} and justify its practical relevance by highlighting several application scenarios in online advertising, wireless networks, and cloud and edge resource allocation. We also introduce a new definition for the suboptimality gap in \MATOBHR as a key parameter to explicitly characterize the impact of free exploration in the regret analysis.

\noindent{\bf Algorithm design: } In Section~\ref{sec:algorithm}, we present \FreeExp, a cooperative learning algorithm that tackles \MATOBHR and implements the idea of free exploration. The high level idea of \FreeExp is that agents judiciously reduce the selection of arms that are likely to be local optimal for other agents. Instead, by cooperation, those agents can still get the observations on those arms from others without regret cost.
% Toward this, an agent may choose to broadcast the empirically optimal arm to other agents periodically, stopping other agents pulling this arm and saving the cost.
In doing so, free exploration of some arms becomes possible and the cooperative bandit algorithm achieves significant improvement in regret.
A key technique in \FreeExp is to perform periodic pulls of the empirical local optimal arms (i.e., the arm with the highest empirical mean) while balancing between exploration and exploitation, which guarantees that the empirical optimal arm is indeed the ground truth local optimal arm in most time slots.
% In our proposed algorithm,one agent enjoys the free exploration by avoiding other agents' empirical optimal arms, exploits its own empirical optimal arm, and explores other arms with high uncertainty (high KL-UCB indexes).

\noindent{\bf Regret analysis: } \rev{In contrast to the common regret analysis in multi-agent bandits where only the pulled arm matters regardless of the agent who pull the arm, in \MATOBHR, we have to address a unique technical challenge since the regret cost of pulling an arm depends not only on which arm is pulled, but also on which agent pulls it.} In Section~\ref{sec:analysis}, we tackle this challenge and derive a regret lower bound for \MATOBHR that echos the importance of recognizing free explorations:
arms that can be freely explored only cause constant regret, instead of the usual logarithmic regret in \texttt{MA2B}.
We derive the regret upper bound of the \FreeExp algorithm which matches the regret lower bound up to a constant factor.
    {Deriving this result requires new analysis techniques (see Theorem~\ref{thm:free-exp-upper-bound}'s proof sketch in Section~\ref{sec:analysis} for detail).}
The tightness of both regret upper and lower bounds reflects the intrinsic property of \MATOBHR where free exploration plays a key role, and that \FreeExp is near-optimal. A surprising observation is that in the special cases where every arm is local optimal for at least one agent (reasonable when $M\ge K$), \FreeExp achieves an $O(1)$ regret.


\noindent{\bf Numerical results: } In Section~\ref{sec:simulations}, we report numerical experiments of comparing our algorithm to several baselines.


\subsection{Technical Comparison to the Prior Work}
\label{sec:comparison-AC-MA2B}




In this section, we highlight our contribution in leveraging free exploration by applying our algorithm to the action-constrained \MATOB problem (\ACMAB) which was recently studied by~\citet{yang2022distributed}.
In \ACMAB, each agent \(i\in\mathcal{M}\) only pulls from a subset of arms \(\mathcal{K}^\brai\subset \mathcal{K}\) and its goal is to find the local optimal arm in $\mathcal{K}^\brai$.
\ACMAB can be regarded as a special case of \MATOBHR when agent \(i\)'s specific reward \(\nu^\brai(k)\) for arm \(k\) is \(0\) if \(k\in\mathcal{K}^\brai\), and \(-b\) if \(k\not\in\mathcal{K}^\brai\), where \(b>0\) and \(\mu(k) \in (0,b)\) for all arm \(k\) (see Remark~\ref{rmk:local-arm-set} for a formal definition).
Since agent \(i\) knows its agent-specific reward means, she would never pull arms with \(\nu^\brai(k) = -b\) and thus is equivalent to only having access to arms in the constrained arm set \(\mathcal{K}^\brai\).
We provide a simple example in Table~\ref{tab:simple-example} to illustrate the benefit of free exploration which substantially improves regret as compared to the classic non-cooperative algorithms and the cooperative approach in~\citet{yang2022distributed} as a special case.


Next, we present the theoretical improvement.
Recall that the non-cooperative optimal total regret of classic MAB~\citep{lai1985asymptotically} for all agents in \(\mathcal{M}\) is
\[O\left(\sum_{i\in\mathcal{M}}\sum_{k\in\mathcal{K}^\brai\setminus\{k_*^\brai\}}\frac{\Delta^\brai(k)\log T}{\kl(\mu(k), \mu(k)+\Delta^\brai(k))}\right),\]
where the suboptimality gap \(\Delta^\brai(k)\coloneqq \mu(k_*^\brai) - \mu(k)\) is
the difference of reward means between
agent \(i\)'s optimal arm \(k_*^\brai\) and arm \(k\),
and \(\kl(a,b)\) is the KL-divergence between two Gaussian distributions
with means \(a\) and \(b\) and the same variance (defined later). To improve total regret through cooperation, \citet{yang2022distributed} proposed cooperative extensions to classic learning algorithms, e.g., \texttt{UCB}~\citep{auer2002using}, which improved the total regret to
\begin{equation}
    \label{eq:reg_ac}
    O\left(\sum_{k\in \cup_i (\mathcal{K}^\brai\setminus\{k_*^\brai\})}
    \frac{\bar{\Delta}(k)\log T}{\kl(\mu(k), \mu(k) + \bar{\Delta}(k))}\right),
\end{equation}
where \(\bar{\Delta}(k)\) denotes the smallest reward mean gap of arm \(k\) compared to the local optimal arms \emph{(excluding arm \(k\))} among agents having access to arm \(k\).

The regret of applying \FreeExp to \ACMAB is
\begin{equation}
    \label{eq:reg_free_exp}
    O\left(\sum_{k\in  \cup_i\mathcal{K}^\brai\setminus \cup_i\{k_*^\brai\}}
    \frac{\bar{\Delta}(k)\log T}{\kl(\mu(k), \mu(k) + \bar{\Delta}(k))}\right).
\end{equation}
The improvement of our result lies in the summation range. Specifically, the summation range \(\cup_i\mathcal{K}^\brai\setminus \cup_i\{k_*^\brai\}\) in \eqref{eq:reg_free_exp} is a \emph{subset} of \eqref{eq:reg_ac}'s
\(\cup_i (\mathcal{K}^\brai\setminus\{k_*^\brai\})\).
The summation range in \eqref{eq:reg_free_exp} excludes the regret impact of arms in \(\cup_i\{k_*^\brai\}\), i.e., arms that are optimal to at least one agent; these arms are freely explored.
% the set \(\cup_i\{k_*^\brai\}\), which includes the any arm that is suboptimal in at least one agent.
In contrast, the regret of \citet{yang2022distributed} in \eqref{eq:reg_ac} is over \(\cup_i (\mathcal{K}^\brai\setminus\{k_*^\brai\})\), which counts some arms that are optimal for some agents (and can be freely explored). We note that this improvement can be substantial.
Especially, when all arms in \(\mathcal{K}\) are locally optimal for some agents, the regret upper in \eqref{eq:reg_free_exp} is \(O(1)\), e.g., the simple example in Table~\ref{tab:simple-example}. This implies that capturing the benefit of free exploration requires the  development of a completely new cooperative algorithm as explained in Section~\ref{sec:algorithm}.


The grouped \(K\)-armed bandits model proposed by~\citet{baek2021fair} is almost equivalent to \ACMAB~\cite{yang2022distributed} except for minor differences in how their actions are constrained---the grouped bandits' action constraint depends on the arrived group while \ACMAB's is associates to the agents.
Therefore, the grouped bandits model
can also be regarded as a special case of our \MATOBHR model.
\citet{baek2021fair} proved that the \texttt{KL-UCB} algorithm~\cite{cappe2013kullback} can address their grouped bandits model with the regret performance as follows, \[
    \limsup_{T\to\infty} \frac{\ERT}{\log T} \le \!\!\!\!\!\!\sum_{k\in  \cup_i\mathcal{K}^\brai\setminus \cup_i\{k_*^\brai\}} \!\!\!\!\!
    \frac{\bar{\Delta}(k)}{\kl(\mu(k), \mu(k) + \bar{\Delta}(k))}.
\]
We emphasize that the above bound of \citet{baek2021fair} was in an asymptotic form (i.e., for \(T\to \infty\)),
while \FreeExp's regret bound is in a non-asymptotic form (i.e., for any time \(T\), see Eq.\eqref{eq:finite-time-regret-upper-bound} of Theorem \ref{thm:free-exp-upper-bound}),
which differs a lot in handling the regret of free arms (see Remark~\ref{rmk:free-arm-constant-regret} for detail).
Here, we pick the toy example in Table~\ref{tab:simple-example} to illustrate the difference; this can be generalized to any case that all arms are free arms.
In this example, \texttt{FreeExp} attains the \(O(1)\) regret, while \texttt{KL-UCB}'s regret was \(o(\log T)\) (or, \(O(\log \log T)\) specifically)~\citep{baek2021fair}.
In Section~\ref{sec:simulations}, we conduct numerical comparisons to corroborate the advantage of \texttt{FreeExp} over \texttt{KL-UCB}.
Also, we emphasize that our regret upper bound is proved for the \texttt{MA2B-HR} model
which is more general than~\citet{baek2021fair}'s grouped bandits model.



