
\section{Additional Discussion on Model}
\subsection{Other Possible Scenarios of \MATOBHR}
\label{app:hr-four-cases}
\begin{table*}[htb]
    \caption{Different possible scenarios of the \MATOBHR model based on agent-specific reward values.
        % Case (1) can reduce to the case that all agents play the same bandit game. This setting has been studied in previous \MATOB works, e.g., in~\cite{landgren2016distributed,martinez2019decentralized}.
        %     Case (2) reduces to the case that each agent plays a different bandit game. Therefore, no cooperation can happen. Sections~\ref{sec:sys_model} and~\ref{sec:not} presents Case (3) where agents can enjoy both free exploration and cooperation. 
    }
    \vspace{-3mm}
    \label{tab:hr-four-cases}
    \centering
    % \footnotesize
    \begin{tabular}{|c|c|c|}
        \hline
        \(\nu^\brai(k)\) & Homogeneous (\(\nu^\brai(k) = \nu^\braj(k), \forall i,j \in \mathcal{M})\)
                         & Heterogeneous                                                              \\\hline\hline
        Unknown          & \multirow{2}{*}{(1)  The majority of prior work on \texttt{MA2B}}
                         & (2)  No useful information to cooperate
        \\\cline{1-1}\cline{3-3}
        Known            &
                         & (3) This work (Section~\ref{sec:model})                                    \\\hline
    \end{tabular}
\end{table*}

While the focus of this paper is on \MATOBHR with known and heterogeneous agent-specific rewards, one can imagine other settings of this model, as outlined in Table~\ref{tab:hr-four-cases}.
% With different settings of agent-specific reward mean \(\nu^\brai(k)\),
% \MATOBHR can also cover a known \texttt{MA2B} model that has been studied and
% a \texttt{MA2B} model similar to known ones.
In particular, the agent-specific reward mean \(\nu^\brai(k)\) can be
(i) homogeneous or heterogeneous among different agents
(i.e., \(\nu^\brai(k) = \nu^\braj(k),\,\forall i,j\in\mathcal{M}\) or not);
(ii) known or unknown by the agent.

Given the above possibilities, there are three scenarios. In scenario (1) agent-specific reward means \(\nu^\brai(k)\) are identical across all agents, and \MATOBHR reduces to the case where all agents play the same bandit game, which has been studied previously, e.g., in~\cite{landgren2016distributed,martinez2019decentralized}.
In scenario (2) agent-specific reward means are heterogeneous and unknown. In this case, cooperation among agents becomes impossible since each agent is essentially solving a different bandit problem and there is no useful information to share between agents.
Scenario (3) with heterogeneous and known agent-specific reward means is of interest in this paper.

\subsection{The Difference between \MATOBHR and Contextual Bandits}\label{subapp:contextual-bandit-model}

Although the contextual bandits model, e.g., ~\citet{li2010contextual, slivkins2011contextual}, can capture the reward heterogeneity of agents,
contextual bandits cannot express the advantage of free exploration as clearly as \MATOBHR.
One needs to associate the heterogeneous rewards (agents) with contexts to model the reward heterogeneity via contexts.
However, some contexts (agents) that can be utilized to explore some arms freely may arrive rarely, and, therefore, their corresponding free arms cannot be freely explored at most times.
For example, in the adversarial arrival setting, these contexts may only arrive a few times, and in the stochastic case, these contexts may arrive with a pretty small probability, e.g., \(1/T\). Instead, the modeling of this paper allows agents to sample their local optimal arms and provides room for free exploration. In addition, as we explain in the next section, several application scenarios can be captured by \MATOBHR and its special cases~\cite{yang2022distributed,baek2021fair}.
% We leave the extension 
