% !TeX root = ..\freeExp.tex

\section{Related Works}
\label{sec:related-works}


The most relevant work to us is~\citet{yang2022distributed} which considers a special case of heterogeneous rewards with known agent-specific rewards. In Section~\ref{sec:comparison-AC-MA2B}, we discussed in details how the \MATOBHR model covers the \texttt{AC-MA2B} model studied by~\citet{yang2022distributed} as a special case. We provide a tighter regret lower bound and a more efficient algorithm than those given by~\cite{yang2022distributed} as we discover the \emph{free exploration} mechanism while \cite{yang2022distributed} was not aware of it. Further, \cite{yang2022distributed} additionally considers an asynchronous action frequencies setting, which our algorithm (with minor modifications) can address as well. We omit this extension in our paper and focus on the heterogeneous arm set setting for clearly presenting the free exploration mechanism and its  importance on improving the regret.


The \emph{free exploration} mechanism in cooperation among agents does not make sense when agent specific rewards are unknown, as discussed in Appendix~\ref{app:hr-four-cases}, and hence this setting is not at the core of this paper's interest. Nevertheless, many works study heterogeneous rewards with unknown agent specific rewards~\citep{hossain2021fair, bistritz2021game,mehrabian2020practical,shi2021heterogeneous, zhu2021federated,shi2021federated,shi2021federatedpersonal, chen2018incentivizing, shi2021almost} in the \texttt{MA2B} literature, which are intellectually and practically interesting under various settings and goals.
Among these works,~\cite{bistritz2021game,mehrabian2020practical,shi2021heterogeneous} consider the collision model, where agents who pull the same arm at the same time collide and receive zero reward.
On the other hand,~\cite{zhu2021federated,shi2021federated,shi2021federatedpersonal} study the federated learning framework, where the central server aims to learn the global bandit model through the information agents learned from the local bandit models. It is worth noting that, the term "free exploration" is also used by~\citet{chen2018incentivizing} and \citet{shi2021almost} who study the problem of incentivizing exploration  in multi-armed bandit. Specifically, \cite{chen2018incentivizing, shi2021almost} consider a principal who aims to learn the global bandit model offers bonuses to agents to do explorations on the principal's behalf. \cite{chen2018incentivizing, shi2021almost} study the "free exploration" with regard to the principal's cost, while we study the \emph{free exploration} in cooperation among agents in this work. Hence, leveraging the idea of free exploration in a cooperative multi-agent bandit setting is the unique difference of this work with the prior literature on heterogeneous multi-agent bandits.
The ``free exploration'' also differs from another term ``exploration-free'' recently proposed in contextual bandits~\cite{bastani2021mostly}, where their algorithms did not need to deliberately explore arms while our algorithm explores arms without cost.
Besides, \citet{jiang2023multi} also considered a multi-agent bandits model with agent-dependent rewards. The key difference between this work and ours is that their agent-dependent reward mean was disturbed by a zero-mean Gaussian, while ours is by a non-zero-mean Gaussian. Hence, their model does not provide a chance for free exploration as ours.


Homogeneous arm rewards setting~\citep{landgren2016distributed,martinez2019decentralized, szorenyi2013gossip, landgren2016distributed, buccapatnam2015information, martinez2019decentralized}, in which an arm generates rewards for all agents from the exact same distribution, is the most extensively studied model in the \texttt{MA2B} literature. It is worth noting that the models of~\citet{yang2021cooperative} and~\citet{chawla2020gossiping}, though may seem close to the \texttt{AC-MA2B} model studied by~\citet{yang2022distributed} at first glance, essentially fall into the category with homogeneous agent-specific reward (see Table~\ref{tab:hr-four-cases}). Specifically, \cite{yang2021cooperative} considers the heterogeneity of arms in terms of their feedback rather than their rewards. Therefore, in the model of~\cite{yang2021cooperative}, the reward of an arm is essentially the same for each agent, and the optimal arm is the same one for all agents; hence no room for free exploration. Similarly, there exist a single optimal arm for all agents in the model of \cite{chawla2020gossiping}; hence, \cite{chawla2020gossiping} lets agents update their arm sets, which at the beginning contains different arms, with the goal of eventually containing this optimal arm.


Besides, stochastic rewards with heavy tails~\citep{dubey2020cooperative} and non-stochastic rewards~\citep{bar2019individual, cesa2016delay} have also been studied in the \texttt{MA2B} literature.
Apart from various ways of modeling and assumptions on arm rewards or arm sets, many other variations of \texttt{MA2B} are also studied in the literature. For example,~\cite{kolla2018collaborative, szorenyi2013gossip, chawla2020gossiping, landgren2016distributed, buccapatnam2015information, martinez2019decentralized, bistritz2020cooperative, madhushani2021one, chakraborty2017coordinated, cesa2016delay, hillel2013distributed, dubey2020cooperative, yang2021cooperative, yang2022distributed, sankararaman2019social, feraud2019decentralized} deal with decentralized learning scenarios where agents communicate with each other to improve their performance, while~\cite{shi2021federatedpersonal,mehrabian2020practical,shi2021heterogeneous, shi2021federated, wang2019distributed, wang2020optimal, bar2019individual, chakraborty2017coordinated, dubey2020cooperative} consider the models with central servers or leaders that can coordinate the learning process. Many different communication schemes are also considered in the literature, such as immediate broadcasting~\citep{buccapatnam2015information, yang2021cooperative, yang2022distributed}, peer-to-peer protocols~\citep{szorenyi2013gossip}, gossip-style communication~\citep{martinez2019decentralized,chawla2020gossiping}, etc.

% We thank the reviewer for suggesting these meta bandits works. 
Lastly, there is a similarity between our model and meta bandits~\citep{kveton2021meta,wan2021metadata}, where we assume all agents have the same arm-specific reward distributions but have different agent-specific rewards, while meta bandits assume that all bandit instances are drawn from a common prior (function) but are different realizations.
However, there is a key difference between our model and meta bandits: \emph{The agent-specific reward means in our model are known and given, while the reward mean realizations of meta bandits are unknown and randomly drawn from the prior.} Therefore, the meta bandits algorithms---which learn the common prior via multiple instances' random realizations---does not fit in our case (because our agent-specific reward means are given and fixed) and thus cannot be applied to addressing our model.


