%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage{algorithm,algorithmic}

\usepackage{graphicx}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{subfig}

\usepackage{amssymb}
\usepackage{amsthm}
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% \usepackage{xr}
% \makeatletter
% \newcommand*{\addFileDependency}[1]{% argument=file name and extension
%   \typeout{(#1)}
%   \@addtofilelist{#1}
%   \IfFileExists{#1}{}{\typeout{No file #1.}}
% }
% \makeatother

% \newcommand*{\myexternaldocument}[1]{%
%     \externaldocument{#1}%
%     \addFileDependency{#1.tex}%
%     \addFileDependency{#1.aux}%
% }

%\myexternaldocument{uai2023-supplement.tex}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{A Near-optimal High-probability Swap-regret Upper Bound for
Multi-agent Bandits in Unknown General-sum Games}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Zhiming Huang}
\author[1]{Jianping Pan}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    University of Victoria\\
    BC, Canada
}
  
  \begin{document}
\maketitle

\begin{abstract}
In this paper, we study a multi-agent bandit problem in an unknown general-sum game repeated for a number of rounds~(i.e., learning in a black-box game with bandit feedback), where a set of agents have no information about the underlying game structure and cannot observe each other's actions and rewards. In each round, each agent needs to play an arm~(i.e., action) from a (possibly different) arm set~(i.e., action set), and \emph{only} receives the reward of the \emph{played} arm that is affected by other agents' actions. The objective of each agent is to minimize her own cumulative swap regret, where the swap regret is a generic performance measure for online learning algorithms.
We are the first to give a near-optimal high-probability swap-regret upper bound based on a refined martingale analysis for the exponential-weighting-based algorithms with the implicit exploration technique, which can further bound the expected swap regret instead of the pseudo-regret studied in the literature. 
It is also guaranteed that correlated equilibria can be achieved in a polynomial number of rounds if the algorithm is played by all agents.
Furthermore, we conduct numerical experiments to verify the performance of the studied algorithm.
\end{abstract}

\section{Introduction}\label{sec:intro}
The \emph{multi-armed bandit~(MAB)} is a theoretical model for online learning problems. The name comes from imagining a gambler needs to play one of the arms on a slot machine in each round. If an arm is played, then the gambler will receive a random reward. The objective of the gambler is to accumulate as many rewards as possible within $T$ rounds. As the information about which arm can return the highest rewards is not a prior knowledge, the gambler faces a dilemma in each round between playing the currently best arm~(i.e., exploitation) or playing other arms to learn more about their rewards~(i.e., exploration). 

To adapt to more complex scenarios in reality, many variants of MABs have been proposed. In this paper, we study a variant called \emph{multi-agent bandits in an unknown general-sum game~(MAB-UG)}, motivated by many real-world problems such as end-to-end congestion control in computer networks. In this case, each host has no information about others and needs to choose a transmission rate, hoping to maximize its throughput without congesting the network. Another example is the medium access control in wireless communications, where a set of devices need to access a shared communication channel to send packets in each time slot. 

The MAB-UG setting can be referred to as the black-box game studied in~\citet{nax2016learning}, where a set of agents $\mathcal{N}$, each associated with $K_n$ (possibly different) arms~(i.e., actions), are playing an unknown general-sum game repeated for $T$ rounds. All agents have no information about the structure of the underlying game and cannot observe each other's actions and rewards. In each round, each agent needs to play an arm $a_n^t$ from an arm set $A_n$, and observes the corresponding reward/loss.

The only information is the observed reward/loss for their own played arm in each round. Thus, each agent is facing a non-stochastic multi-armed bandit problem with adaptive~(i.e., non-oblivious) adversaries.
The objective for each agent is to accumulate as many rewards as possible and the empirical joint distribution of all agents' actions reaches an $\epsilon$-correlated equilibrium~\citep{aumann1974subjectivity}, a concept more general than the well-known Nash equilibrium,  within $T$ rounds.  
Intuitively, the $\epsilon$-correlated equilibrium is 
a state that the expected incentives~(e.g., the reward difference) for each agent to deviate from a suggested action are no more than $\epsilon\geq 0$, where the expectation is taken with respect to the joint distribution of all agents' actions. 

As each agent has very limited knowledge about the environment and can only learn from the rewards of the played arm that is affected by others' actions in each round, the algorithm to address MAB-UG must be carefully designed to balance the tradeoff between exploration and exploitation. 
The performance of an algorithm is usually measured by \emph{regret}. The most oft-used definition of regret in the bandit literature is called the \emph{external regret}~\citep{cesa2006prediction,lattimore2020bandit}, which measures the performance loss of an algorithm against a set of competitors always playing a fixed action. However, minimizing the external regret is not enough for MAB-UG, as another objective is to achieve the $\epsilon$-correlated equilibrium. Fortunately, it is proved in~\citet{hart2000simple,cesa2006prediction} that if every agent plays an algorithm that minimizes \emph{internal regret}, then the empirical joint distribution of actions converges to an $\epsilon$-correlated equilibrium. The internal regret is defined to be the performance loss for an algorithm that plays arm $a$ instead of playing another arm $a^\prime$. In this paper, we study a stronger regret notion called \emph{swap regret} introduced by~\citet{blum2007external}, which is a generalization of the above two regrets, comparing the performance of a learning algorithm against a larger set of competitors. The swap regret uses swap functions $F$ that take the arms played by an algorithm as input and output the arms to be compared. Thus, by changing the swap functions, the swap regret can boil down to external regret and internal regret.

The swap regret has been extensively studied in terms of \emph{pseudo-regret}~(or \emph{weak regret})~\citep{blum2007external,stoltz2005incomplete,ito2020tight}, i.e., $\max\limits_{F} \mathbf{E}\left[\sum\limits_{t=1}^T \sum\limits_{a\in A_n} \mathbf{1}[a_n^t = a] r_{a,F(a)} \right]$, where $r_{a,F(a)}$ is the instantaneous swap regret with arm $a$ and swap function $F$, and \emph{conditionally expected swap regret}~\citep{jin2022v}, i.e., $\max\limits_{F} \sum\limits_{t=1}^T \sum\limits_{a\in A_n} p_a^t r_{a,F(a)} $, which also bounds the pseudo-regret by taking expectation on the randomness of algorithms. However, bounding the above regret can only guarantee the expected swap regret (i.e., $\mathbf{E}\left[\max\limits_{F} \sum\limits_{t=1}^T \sum\limits_{a\in A_n} \mathbf{1}[a_n^t = a] r_{a,F(a)} \right]$) is bounded when adversaries are not adaptive~(i.e., oblivious)~\citep{audibert2010regret}, but each agent in MAB-UG is facing other agents as adaptive adversaries. Thus, a more meaningful but challenging bound is on the instantaneous swap regret~(i.e., $\max\limits_{F} \sum\limits_{t=1}^T \sum\limits_{a\in A_n} \mathbf{1}[a_n^t = a] r_{a,F(a)} $) for any sequence of actions and rewards, which is helpful not only to equilibrium convergence but also to the bound of the expected swap regret with respect to all agents' randomness.

The main contribution of our work is to give an instantaneous swap regret analysis for the exponential-weighting-based algorithms called \emph{learning for correlated equilibrium with implicit exploration~(LCE-IX)}. LCE-IX is based on the swap-regret-minimizing framework proposed by~\cite{blum2007external}, and the main idea is to call $K_n$ exponential-weighting-based algorithms with the Implicit eXploration~(IX) technique~\citep{kocak2014efficient,neu2015explore} as subroutines.  Then, the probability of selecting an arm is obtained by the Markov steady-state distribution of the Markov process among $K_n$ subroutines, and the reward/loss is proportionally fed to the subroutines for updates.

However, the existing concentration inequality for the IX technique cannot be simply applied to the analysis of swap regret. The main difficulties are twofold. First, the swap regret is only equivalent to the sum of the external regret for subroutine algorithms in expectation. In this sense, the existing concentration inequality for IX can only give a high-probability bound on the conditionally expected swap regret.  When we analyze the instantaneous swap regret, we cannot convert it directly to the sum of the instantaneous external regret for subroutine algorithms.  Second, the reward/loss of each arm is a result of all agents' actions, which is not determined at the beginning of each round as in the single-agent bandit setting (see more discussions in Sec.~\ref{sec:problem}).

To address this problem, we prove a novel general-form concentration inequality between the IX loss estimator and the swapped loss based on a refined martingale analysis by treating the $K_n$ subroutine algorithms as a whole.  Based on this concentration inequality, we show that with probability at least $1-\delta$ for $\delta \in (0,1)$, the instantaneous swap regret is bounded in $O(K_n \sqrt{T\log(K_n/\delta)})$ for each agent $n\in \mathcal{N}$.

Furthermore, by integrating the tails of this high-probability bound for the instantaneous swap regret, we show the expected swap-regret bound is in $O(K_n \sqrt{T\log(K_n)})$ with respect to all agents' randomness. The above swap-regret bounds are near-optimal with an $O(\sqrt{K_n})$ gap from the swap-regret lower bound by \cite{ito2020tight} for a related model. 
It is also guaranteed that LCE-IX can converge to $\epsilon$-correlated equilibria for unknown general-sum games in a polynomial number of rounds if the algorithm is played by all agents. 
Numerical experiments verifies the performance of LCE-IX.
%Furthermore, we verify the effectiveness of LCE in the wireless medium access-motivated experiments.

The rest of the paper is organized as follows. In Sec.~\ref{sec:relatedworks}, we review the works that are most related to MAB-UG. The problem settings are described in Sec.~\ref{sec:problem}. The LCE-IX algorithm is proposed in Sec.~\ref{sec:Algorithm}, with analytical results presented in Sec.~\ref{sec:results}. 
The experiment results are shown, compared and discussed in Sec.~\ref{sec:experiments}.
Sec.~\ref{sec:conclusion} concludes the paper. 
The detailed proofs of the swap-regret upper bound are deferred to the Appendix in the supplementary materials.


\section{Related Works}\label{sec:relatedworks}
\textbf{Multi-agent bandits:} 
Multi-agent bandits consider a group of agents participating in decision making, and aim to improve learning efficiency through collaborations. The works about multi-agent bandits are mainly focused on improving rewards by communication~\citep{buccapatnam2015information,chakraborty2017coordinated,kolla2018collaborative,vial2021robust}, identifying the best arm to avoid collision~\citep{bubeck2020non,liu2010distributed,hillel2013distributed,szorenyi2013gossip,jamieson2014best}, and voting for playing arms~\citep{dubey2020cooperative}. All the above bandit settings assume the arm set for each agent is identical, and the reward for an agent does not depend on the actions of other agents, or just follows a simple collision model. MAB-UG considers (possibly) varied arm sets for different agents and more general competitions among agents.


\textbf{Learning in games:}
The history of learning in games can be traced back to the fictitious play for the two-player zero-sum games~\citep{brown1949some,robinson1951iterative}. Nevertheless, such a fictitious play requires that the decisions of opponents can be observed, and thus it cannot be applied to the unknown games where the agents can only observe their own outcomes (or rewards). To address the challenges of unknown games, online learning has been introduced by
many works for specific games such as potential games~\citep{coucheney2015penalty,cohen2017learning,pmlr-v139-bielawski21a,pmlr-v139-mguni21a}, and mean-field games~\citep{pmlr-v139-min21a,pmlr-v139-wang21j,pmlr-v139-xie21g}. 
% Markov games~\cite{tian2021online,pmlr-v139-qiu21b}. 
However, the above solutions for specific games depend on corresponding properties (e.g., potential functions for potential games), and thus cannot be easily extended to the general-sum games. Thus, we focus on the learning in the unknown general-sum games~(i.e., black-box games~\citep{nax2016learning}), which is a basic case of learning in general-sum Markov games~\citep{littman1994markov}.

Regarding the unknown general-sum games, there are mainly two lines of research depending on the observability of rewards. If the reward of an action can be observed regardless of whether it is played or not, we call it the \emph{full-information} model~\citep{cesa2006prediction}, and if only the reward of a played action can be observed, then it is the \emph{partial-information} model (i.e., \emph{bandit} feedback). Recent years have witnessed steady progress in learning general-sum games in the full-information model~\citep{krichene2015online,palaiopanos2017multiplicative,chen2020hedging,daskalakis2021near,anagnostides2022near,farinanear}. However, the results for the full-information model cannot be easily extended to the partial-information model, as less information is observed in each round, which makes the partial information model more challenging. The first work that addressed the unknown general-sum games with bandit feedback is \citet{auer2002nonstochastic}, where an exponential-weighting-based technique is proposed to minimize the external regret. However, it is typically one of the goals in general games to search for correlated equilibria, and it is
shown in~\citet{cesa2006prediction} that only minimizing external regret cannot achieve this goal. 

\begin{table}[h]
\centering
\caption{Swap-regret bounds for \emph{exponential-weighting}-based algorithms with bandit feedback}\label{tab:1}
\resizebox{1\columnwidth}{!}{%
\begin{tabular}{ll}
\hline
          Upper bound, Computational cost, Regret notion                                                                        & Lower bound                                         \\ \hline
 $O\left(\sqrt{T K_n^{3} \log (K_n)}\right)$, poly-time, pseudo-regret {\citep{blum2007external}}                                              &       $\Omega\left(\sqrt{T K_n }\right)$ {\citep{blum2007external}}                                              \\
                     \begin{tabular}[c]{@{}l@{}}$O\left(\sqrt{T K_n^{2} \log (K_n)}\right)$, exp-time, pseudo-regret {\citep{stoltz2005incomplete}}\end{tabular} &  \\
                     \begin{tabular}[c]{@{}l@{}}$O\left(\sqrt{T K_n^2 \log (K_n)}\right)$, poly-time, pseudo-regret {\citep{ito2020tight}}\end{tabular}        &     $\Omega\left(\sqrt{T K_n \log (K_n)}\right)$ {\citep{ito2020tight}}                                                \\
                     \begin{tabular}[c]{@{}l@{}}$O\left(\sqrt{T K_n^2 \log (K_n/\delta)}\right)$, poly-time, conditionally expected regret {\citep{jin2022v}}\end{tabular}     &   \\
                     \begin{tabular}[c]{@{}l@{}}$O\left(\sqrt{T K_n^2 \log (K_n/\delta)}\right)$, poly-time, instantaneous regret (our work, Theorem~\ref{thm:regret3})\end{tabular}     &                                                     \\
                     \begin{tabular}[c]{@{}l@{}}$O\left(\sqrt{T K_n^2 \log (K_n)}\right)$, poly-time, expected regret (our work, Theorem~\ref{thm:expected})\end{tabular}     &                                                     \\ \hline
\end{tabular}%
}
\end{table}

\cite{blum2007external} generalized the notion of external and internal regrets to the swap regret, and proposed a polynomial-time swap-regret-minimizing framework based on $K_n$ external-regret-minimizing subalgorithms, where $K_n$ is the number of arms.
They proved that if the external pseudo-regret of each subalgorithm can be represented by a concave function $r(T)$, where $T$ is the time horizon, and if the dependency on $K_n$ is ignored in $r(T)$, the swap pseudo-regret of their proposed algorithm is $K_n \cdot r(T)$. Therefore, as each exponential-weighting subalgorithm has an external-pseudo-regret bound of $r(T) = O(\sqrt{TK_n\log (K_n)})$~\citep{auer2002nonstochastic}, the analysis of \cite{blum2007external} gives a pseudo-regret bound of $O(K_n\sqrt{TK_n\log (K_n)})$ for their proposed algorithm.

Later, this bound was improved by \cite{stoltz2005incomplete} to $O(K_n\sqrt{T\log (K_n)})$ but with an exponential computation complexity. On the other hand, \cite{ito2020tight} improved the upper bound for the swap pseudo-regret to $K_n \cdot r(T/{K_n})$ with a polynomial-time algorithm by adding another layer of randomness to the original framework~\citep{blum2007external}, where in each round only one subroutine is selected according to the calculated Markov steady distribution. The selected subroutine selects an arm, and the reward will be \emph{entirely} fed to this subroutine algorithm for updates.
The modified framework gives a pseudo-regret of $O(\sqrt{TK_n^2\log (K_n)})$ for the exponential-weighting-based subalgorithms.\footnote{The swap regret for the modified framework by \cite{ito2020tight} can be tighter if mirror descent algorithms~\citep{zimmert2019optimal} are used. However, in this paper, we only discuss the swap-regret bound for the exponential-weighting-based algorithms.} It was also proved by~\cite{ito2020tight} that the lower bound for swap regret is $\Omega\left(\sqrt{T K_n \log (K_n)}\right)$, which is tight in the full-information but not partial-information models.
Recently, \cite{jin2022v} proved a high-probability bound of $O(K_n\sqrt{T\log(K_n/\delta)})$ for the conditionally expected swap regret, which can bound the pseudo-regret by integrating the tails. Table~\ref{tab:1} gives a summary of the swap-regret bounds for exponential-weighting-based algorithms with bandit feedback.

Thus, we are the first to prove a high-probability bound of $O(K_n\sqrt{T\log(K_n/\delta)})$ for the instantaneous swap regret and bound the expected swap regret in~$O(K_n\sqrt{T\log(K_n)})$ with respect to all agents' randomness, which is near-optimal because of an $O(\sqrt{K_n})$ gap from the swap-regret lower bound~\citep{ito2020tight} despite for full-information models. 
 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% Problem Setting
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Problem Formulation}\label{sec:problem}
\subsection{The MAB-UG Model}
In MAB-UG, the reward of each agent's action will be affected by the actions of other agents, and each agent has no prior knowledge about the environment such as the number of agents, the reward of each action, and the actions of other agents. A simple example of MAB-UG with two agents and two arms for each agent is shown in Fig.~\ref{fig:model}, where in the current round, Agent 1 plays arm $a$ and only observes a normalized reward of $0.8$, and Agent 2 plays arm $c$ and only observes a normalized reward of $0.2$.  Both agents have no information about the arm played by the other agent, nor the rewards of the arms that are not played.
%denoted by a tuple $(\mathcal{N},\mathcal{A}, \mathcal{U})$. 
\begin{figure}[h]
\centering
    \includegraphics[width = 0.7\columnwidth]{Figures/modelexample.pdf}
    \caption{An example of MAB-UG with two agents and two arms for each agent.}
    \label{fig:model}
\end{figure}

Formally, let $\mathcal{N}:=\{1,\ldots,N\}$ be the set of all agents and each agent $n\in \mathcal{N}$ is associated with a finite set of arms~(i.e., actions) $A_n$ with size $K_n$. The arm set for each agent is not required to be identical.
Let $\mathcal{A}:=\prod\limits_{n\in \mathcal{N}} A_n$ be the space of all such arm sets, and $\mathbb{A}\in \mathcal{A}$ be an action profile~(i.e., a vector of all agents' actions). 
The reward for agent $n$ playing arm $a_n^t \in A_n$ in round $t$ is determined by function $u_n: \mathcal{A} \rightarrow [0,1]$, which maps the actions of all agents to agent $n$'s rewards $u_n(a_n^t;\mathbb{A}^t_{-n})$.\footnote{$(a_n^t;\mathbb{A}_{-n}^t)$ is an abbreviation of $\mathbb{A}^t:=(a_1^t,\ldots,a_n^t,\ldots,a_N^t)$ with a highlight of agent $n$'s action $a_n$ against other agents' actions.} Note that our algorithm and analyses also work for a time-varying reward function $u_n^t$. In addition, $u_n^t$ can be determined in either an oblivious way or a non-oblivious~(i.e., adaptive) way, corresponding to the oblivious adversary or the non-oblivious adversary in the single-agent bandits. In an oblivious way, $\{u_n^t\}_{t>0}$ is chosen at the beginning of the game, while in a non-oblivious way,  each $u_n^t$ is determined conditioned on all the agents' actions in the past. 

One of the main differences between multi-agent bandits and single-agent bandits is the measurability of the rewards. If we are in the single-agent bandits, regardless of whether $u_n^t$ is determined obliviously or non-obliviously, the reward of each arm in each round $t$ is determined at the beginning of that round, before the agent plays an action. However, in the multi-agent bandits, as the reward of each arm for each agent is conditioned on other agents' actions, the reward of each arm in each round cannot be determined until all agents have played an action in that round.

Let $\mathcal{U}:=\{u_1,\ldots,u_N\}$ be the set of reward functions for $N$ agents. Note that neither $\mathcal{N}$ nor $\mathcal{U}$ is a prior knowledge to each agent, and each agent $n$ only knows in advance her own set of arms~$A_n$. 

In each round $t=1, \ldots, T$, each agent $n\in\mathcal{N}$ can use a \emph{mixed} strategy to play an arm $a_n^t\in A_n$ according to a probability distribution over arms $P_n^t:=\{p_a^t:\forall a \in A_n\}$, i.e., play arm $a\in A_n$ with probability $p_a^t$. 
Then, each agent $n$ can only observe her own instantaneous reward $X_n^t:= u_n(a_n^t;\mathbb{A}_{-n}^t)$.\footnote{For the convenience of algorithm description and analysis, we sometimes use an equivalent notion called the instantaneous loss, i.e., $1-X_n^t$, and denote by $y_{n,a}^t:= 1-u_n(a;\mathbb{A}_{-n}^t)$  the instantaneous loss function if agent $n$ plays $a\in A_n$.} Both the actions and the number of other agents cannot be observed. The objective of each agent is to accumulate as many rewards as possible over $T$ rounds. 

\subsection{Problem Formulation}
As each agent has little knowledge about the environment, it is inevitable for each agent to suffer a \emph{regret}, i.e., the loss of rewards for not playing the optimal arm in hindsight that returns the highest cumulative rewards. In bandit problems, the problem of maximizing the cumulative reward is always converted to the problem of minimizing the regret.
The notion of regret has many forms. The most oft-used regret in the bandit literature is the \emph{external regret}~\cite{cesa2006prediction}.  Let $\mathbf{1}[a_n^t = a]$ be the indicator function that returns $1$ if $a$ is the played arm in round $t$ and $0$ otherwise.  The external regret $R_{n}^{\rm ext} (T)$ for agent $n$ compares the cumulative reward of a learning algorithm with that of a set of competitors that always play a fixed arm up to round $T$, which is defined as follows:
\begin{equation*}
\resizebox{1\hsize}{!}{$
\begin{aligned}
    R_{n}^{\rm ext} (T) &:= \max_{a^\prime \in A_n} \sum_{t=1}^T u_n (a^\prime ;\mathbb{A}_{-n}^t) - \sum_{t=1}^T \sum_{a \in A_n} \mathbf{1}[a_n^t = a ]  u_n (a ;\mathbb{A}_{-n}^t),
\end{aligned}
$}
\end{equation*}
However, only minimizing the external regret cannot guarantee the plays of agents will reach an equilibrium. Therefore, we need a strictly stronger notion of regret that is the \emph{internal regret}, which compares the actions of an agent in a pair-wise manner:
\begin{equation}\label{eq:intern}
    R_{n}^{\rm int} (T)  :=  \max_{a,a^\prime \in A_n}  \sum_{t=1}^T r_{(a,a^\prime),n}^t,
\end{equation}
where $$r_{(a,a^\prime),n}^t := \mathbf{1}[a_n^t = a] \left(u_n(a^\prime ;\mathbb{A}_{-n}^t) - u_n(a;\mathbb{A}_{-n}^t)\right)$$ is the instantaneous regret for agent $n$ of having played arm $a$ instead of arm $a^\prime$ in round $t$. 
As proved in~\citet{hart2000simple,cesa2006prediction}, by minimizing the internal regret for each agent, their empirical joint distributions of plays converge to an $\epsilon$-correlated equilibrium, which is defined as follows.
\begin{definition}
Let $\mathbf{P}$ be a joint probability distribution over $\mathcal{A}$. We say $\mathbf{P}$ is an $\epsilon$-correlated equilibrium if the expected incentive for each agent $n$ to deviate from action $a$ to any other action $a^\prime \in A_n$ is no more than $\epsilon\geq 0$, i.e., $\forall n\in\mathcal{N}$, we have
\begin{equation}\label{eq:correlated}
    \sum_{(a;\mathbb{A}_{-n})\in \mathcal{A}}\mathbf{P}((a;\mathbb{A}_{-n}))\left( u_n(a^\prime;\mathbb{A}_{-n}) - u_n(a;\mathbb{A}_{-n}) \right) \leq \epsilon.
\end{equation}
\end{definition}
Note that $\mathbf{P}$ is the joint distribution, not the product distribution, which is the difference between the correlated equilibrium and the Nash equilibrium.  When $\epsilon = 0 $, $\mathbf{P}$ is the correlated equilibrium, which is more general than the well-known Nash equilibrium, as the correlated equilibrium does not require independence among actions.
To give an intuition about the $\epsilon$-correlated equilibrium, consider a case of congestion control in computer networks where a \emph{mediator} (e.g., a router or switch) draws an action profile from $\mathbf{P}$ and privately recommends each action (e.g., the packet sending rate) to the corresponding host. If no host has an incentive of more than $\epsilon$ to choose a different action, provided that other hosts follow the mediator's recommendation, then $\epsilon$ yields an $\epsilon$-correlated equilibrium. 
Our objective is to achieve an $\epsilon$-correlated equilibrium without a mediator by minimizing the internal regret for each agent.

In this paper, we consider a more general notion of regret, called the \emph{swap regret}~\cite{blum2007external}, which can unify both the external regret and internal regret into the same framework by a swap function $F_n: A_n \rightarrow A_n$ that takes $a\in A_n$ as input and outputs $a^\prime \in A_n$. Let $\mathcal{F}$ be a finite set of $F_n$. Then,
the instantaneous swap regret for agent $n$ with $\mathcal{F}$ up to round $T$ is defined as follows:
\begin{equation}\label{eq:swap}
\resizebox{1\hsize}{!}{$
\begin{aligned}
    &R_{n}^{\rm swa} (T,\mathcal{F}) =\max_{F\in \mathcal{F}}\sum_{t=1}^T \sum_{a \in A_n}  \mathbf{1}[a_n^t = a]  \left( u_n(F(a);\mathbb{A}_{-n}^t) -  u_n(a;\mathbb{A}_{-n}^t) \right).
\end{aligned}
$}
\end{equation}
We can boil down the swap regret to the external regret by letting $\mathcal{F}$ be a set of $K_n$ functions such that for any $a\in A_n$, $F_a \in \mathcal{F} : A_n \rightarrow a$. Similarly, the internal regret can be obtained by letting $\mathcal{F}$ be a set of $K_n(K_n-1)$ functions such that for any pair of $a,a^\prime \in A_n$, we have $F_{(a,a^\prime)}(a) = a^\prime$ and $F_{(a,a^\prime)}(a^{\prime\prime}) = a^{\prime\prime}$ for any other $a^{\prime\prime}\in A_n$. Thus, by minimizing the swap regret of a learning algorithm for a general $\mathcal{F}$ of any possible mappings $F$, we can show that the learning algorithm has a bounded performance gap from a broader range of competitors. We can also minimize the internal and external regrets at the same time, and achieve the $\epsilon$-correlated equilibrium for all agents.
% The motivations for us to study the swap regret are twofold. First, minimizing the swap regret with any $\mathcal{F}$ can minimize the internal and external regrets at the same time. Second, it was proved in \cite{blum2007external} that if each agent involved in a game plays a strategy that can minimize the swap regret, then the empirical distribution of the joint actions played by all agents converges to 

%Therefore, minimizing the swap regret can give us a better result such that $\epsilon$-correlated equilibrium can be achieved, which means all agents 
%Therefore, we convert the problem of maximizing the cumulative rewards to the problem of minimizing the swap regret for each agent. The challenges to addressing the problem are that each agent has no prior knowledge of the environment and each agent can only observe the rewards of their own played arms. Thus, each agent faces a dilemma between exploration and exploitation. The exploration means each agent needs to play each arm for more information, and the exploitation means that each agent needs to play the currently-known best arm to gain more rewards. 

\section{The LCE-IX Algorithm}\label{sec:Algorithm}

The LCE-IX algorithm adopts the swap-regret-minimizing framework introduced by~\cite{blum2007external} and calls $K_n$ Exp3-IX algorithms~\citep{kocak2014efficient,neu2015explore} as subroutines. Each subroutine maintains a meta-distribution, and the action selection probability is calculated from the meta-distributions. The observed reward or loss will be assigned proportionally to each subroutine for updating the meta-distributions.

For each agent $n$, we define a meta-distribution $Q_{a}^t:=\{q_{a,a^\prime}^t:\forall a^\prime \in A_n\}$ for each arm $a\in A_n$ such that $q_{a,a^\prime}^t \in [0,1]$ and $\sum\limits_{a^\prime \in A_n} q_{a,a^\prime}^t =1$. Denote by $\mathbb{Q}_n^t:=[Q_{a}^t]_{a\in A_n}$ the $K_n \times K_n$ matrix with each row being $Q_{a}^t$.  Then, we determine the sample distribution $P_n^t$ by solving the following equations:
\begin{equation}\label{eq:pcal}
    P_n^t = P_n^t \mathbb{Q}_n^t,
\end{equation}
where $P_n^t$ is a row vector of $p_a^t, \forall a\in A_n$ and $\sum\limits_{a\in A_n} p_a^t = 1$. 
That is, for each $a\in A_n$, we have $p_a^t = \sum\limits_{a^\prime \in A_n} p_{a^\prime}^t q_{a^\prime,a}^t$, which is similar to the calculation of the stationary distribution of a Markov process with the transition matrix being $\mathbb{Q}_n^t$.
The intuition behind (\ref{eq:pcal}) is to make the probability of playing arm $a^\prime \in A_n$ directly according to $P_n^t$ be equivalent to the probability of first sampling any arm $a \sim P_n^t$ and  then playing $a^\prime$ according to $Q_{a}^t$.

The suffix `IX' of LCE-IX stands for implicit exploration, which is justified by the $\gamma_t$-biased reward estimator defined as follows. 
Denote by $Y_{a,a^\prime}^t := \frac{\mathbf{1}[a_n^t = a^\prime] p_{a}^t q_{a,a^\prime}^t}{p_{a^\prime}^t} (1-X_n^t)$ the loss of arm $a^\prime$ observed by subroutine $a$. 
Let $\gamma_t$ be a non-negative and non-increasing parameter over time $t$. We then define the $\gamma_t$-biased estimated loss for $Y_{a,a^\prime}^t$ as follows:
\begin{equation*}
    \hat{Y}_{a,a^\prime}^t:= \frac{Y_{a,a^\prime}^t}{q^t_{a,a^\prime}+\gamma_t}.
\end{equation*}
This bias factor $\gamma_t$ is introduced in \cite{kocak2014efficient,neu2015explore}, which is used to smooth the meta-distributions so that the arms with low rewards in the past can still be chosen occasionally for exploration. 


In addition, we also consider the situation where $T$ may not be known a priori. Thus, we consider a non-increasing learning rate $\eta_t$, and the update rule for each meta-distribution is defined as follows:
\begin{equation}\label{update-meta22}
    q_{a,a^\prime}^{t+1} = \frac{\exp{(-\eta_{t+1} \hat{L}_{a,a^\prime}^t)}}{\sum\limits_{a^{\prime\prime} \in A_n}\exp{(-\eta_{t+1} \hat{L}_{a,a^{\prime\prime}}^t)}}.
\end{equation}
By the above modification, we have the LCE-IX algorithm described in Alg.~\ref{exp3IX}.

\begin{algorithm}[h]
\caption{The LCE-IX algorithm}\label{exp3IX}
\begin{algorithmic}[1]
\STATE {\bfseries Input:}{ $n,A_n,\eta_t$}
\STATE // Initialization 
\STATE Set $q_{a,a^\prime}^1 = \frac{1}{K_n}$ and $\hat{L}_{a,a^\prime}^0 = 0, \forall a, a^\prime \in A_n$
\FOR{$t=1,\ldots,T$}
\STATE // Compute the sample distribution, play arms and observe rewards

\STATE Calculate $P_n^t$ based on (\ref{eq:pcal})

\STATE Play an arm $a_n^t \sim P_n^t$
\STATE Observe reward $X_n^t$

\STATE // Update each meta-distribution
\FOR{$a \in A_a$}
\STATE $Y_{a,a^\prime}^t := \frac{\mathbf{1}[a_n^t = a^\prime] p_{a}^t q_{a,a^\prime}^t}{p_{a^\prime}^t} (1-X_n^t) , \forall a^\prime \in A_n$
\STATE $\hat{Y}_{a,a^\prime}^t:= \frac{Y_{a,a^\prime}^t}{q^t_{a,a^\prime}+\gamma_t}, \forall a^\prime \in A_n$
%\State $X_{a,a_n^t} = \frac{p_{a}^t q_{a,a_n^t}^t}{p_{a_n^t}^t} X_n^t$
\STATE $\hat{L}_{a,a^\prime}^{t} = \hat{L}_{a,a^\prime}^{t-1} + {\hat{Y}_{a,a^\prime}^t},  \forall a^\prime \in A_n$ 
\STATE Calculate $Q_{a}^{t+1}$ based on (\ref{update-meta22})
\ENDFOR
\ENDFOR
%\EndProcedure
\end{algorithmic}
\end{algorithm}

\section{Analytical Results for LCE-IX}\label{sec:results}
\subsection{Regret Bound}
As the regret analysis is for each individual agent $n\in \mathcal{N}$, without confusion, we drop the subscript $n$ in some notations for brevity.
Let $\mathcal{G}_{t}$ denote the $\sigma$-algebra generated by the history information of all agents up to round $t$, i.e., $\mathcal{G}_{t}:= \sigma \left(\{a_n^1,r_n^1,\ldots,a_n^{t},r_n^{t}\}_{n \in \mathcal{N}} \right)$. Denoted by $\tilde{Y}_{a,a^\prime}:=\mathbf{1}[a_n^t = a] y_{a^\prime}^t$ the swapped loss from $a$ to $a^\prime$, where $y_{a^\prime}^t:= 1 - X_n^t$.
We first state a novel concentration bound for the $\gamma_t$-biased loss estimator used in LCE-IX, which shows that the cumulative gap between the biased loss estimator $\hat{Y}_{a,a^\prime}^t$ and the swapped loss $\tilde{Y}_{a,a^\prime}^t$ for each agent $n\in \mathcal{N}$ is bounded with a high probability.
\begin{lemma}\label{lm2}
Let $\delta \in (0,1)$ and let $\beta_{a,a^\prime}^t$ be nonnegative and non-increasing (over time $t$)
$\mathcal{G}_{t-1}$-measurable random variables~(i.e., given $\mathcal{G}_{t-1}$, $\beta_{a,a^\prime}^t$ is determined) satisfying $\beta_{a,a^\prime}^t\leq 2\gamma_t$ for all pairs $a, a^\prime \in A_n$.
With probability at least $1-\delta$, we have the following inequality held:
\begin{equation}\label{eq:concerntration3}
   \sum_{t=1}^T \sum\limits_{a \in A_n} \sum_{a^\prime \in A_n} \beta_{a,a^\prime}^t \left(\hat{Y}_{a,a^\prime}^t - \tilde{Y}_{a,a^\prime}^t \right) \leq \log (\frac{1}{\delta}) \ .
\end{equation}
\end{lemma}
\begin{proof}[Proof Sketch]
We only give a proof sketch here, and the detailed proof can be found in Appendix A in the supplementary materials.
First, we construct a sequence of random variables $\{Z_t\}_{t\geq 0}$, where $Z_t  := \exp\left\{{\sum\limits_{s=1}^t \beta_{a,a^\prime}^s \sum\limits_{a \in A_n} \sum\limits_{a^\prime \in A_n} \left(\hat{Y}_{a,a^\prime}^s - \tilde{Y}_{a,a^\prime}^s \right) }\right\}$ for $t>0$ and $Z_0 = 1$, and then prove that $\{Z_t\}_{t\geq 0}$ is a supermartingale with respect to filtration $\{\mathcal{G}_t\}_{t\geq 0}$, i.e., $\mathbf{E}\left[Z_t | \mathcal{G}_{t-1} \right] \leq Z_{t-1}$. 
Finally, the lemma follows the Markov inequality.
\end{proof}
The proof for Lemma~\ref{lm2} is refined beyond the IX concentration bounds studied in \cite{neu2015explore}. The original approach used in \cite{neu2015explore} is for external regret,  but swap regret is only equivalent to the sum of the external regret for subroutine algorithms in expectation. Therefore, we cannot simply adapt their concentration bound to analyze the instantaneous swap regret. 
%will introduce an extra $K_n$ to our bound, as we have $K_n$ meta-distributions and their concentration bound works on each meta-distribution. 
In addition, in the original concentration bound, the probability is taken with respect to only one agent's randomness, which is not suitable for the MAB-UG setting, as the reward/loss for each agent is dependent on all agents' actions. 
%The bound in Theorem~\ref{thm:regret} is for a fixed $\beta$ over time.   
%In addition, the bound in \cite{neu2015explore} requires $\beta_t\leq 2\gamma_t$, while the requirement is relaxed in our bound where $\beta_t\leq g_t\gamma_t$ for any positive $g_t$. 
%The proof technique is also different. In this chapter, we use the martingale-based Azuma's inequality, while the Markov inequality is used in~\cite{neu2015explore}.
To address the issue, Lemma~\ref{lm2} considers the $K_n$ subroutines as a whole, and proves a supermartingale between the sum of IX loss estimators for each meta-distribution and the general swapped loss with respect to all agents' randomness. The following Lemma is a direct result of Lemma~\ref{lm2}, which is essential for the swap-regret analysis.



\begin{lemma}\label{crl:1}
Let $\delta \in (0,1)$. With probability at least $1-\delta$, the following inequalities hold simultaneously:
\begin{equation}\label{eq:lm1}
       \sum\limits_{t=1}^T \sum\limits_{a \in A_n} \sum\limits_{a^\prime \in A_n} {\eta_t}\left(\hat{Y}_{a,a^\prime}^t - \tilde{Y}_{a,a^\prime}^t \right) \leq  \log (\frac{1}{\delta}),
\end{equation}
and for any $F \in \mathcal{F}$,
\begin{equation}\label{eq:lm2}
       \sum\limits_{t=1}^T \sum\limits_{a \in A_n} \left(\hat{Y}_{a,F(a)}^t - \tilde{Y}_{a,F(a)}^t \right) \leq \frac{1}{2\gamma_T} \log (\frac{K_n^{K_n}}{\delta}).
\end{equation}
\end{lemma}
\begin{proof}
    (\ref{eq:lm1}) is obtained by invoking Lemma~\ref{lm2} with $\beta_{a,a^\prime}^t := \eta_t$ for all $a,a^\prime \in A_n$.
    (\ref{eq:lm2}) is obtained by invoking Lemma~\ref{lm2} with $\beta_{a,a^\prime}^t := 2\gamma_T \mathbf{1}[a^\prime = F(a)]$ for all $a, a^\prime, F(a) \in A_n$ and applying the union bound over all $F \in \mathcal{F}$ for at most $|\mathcal{F}| = K_n^{K_n}$ swap functions.
\end{proof}
% The swap regret defined in~(\ref{eq:swap}) for each agent playing Alg.~\ref{exp3} is bounded by the following theorem:

The regret defined in (\ref{eq:swap}) for each agent $n\in \mathcal{N}$ playing the LCE-IX algorithm is guaranteed by the following theorem. 
\begin{theorem}\label{thm:regret3}
Let $\delta \in (0,1)$. With probability at least $1-\delta$, $\eta_t = \sqrt{\frac{\log(K_n)}{t}}$ and $\gamma_t = \eta_{t} /2$,  the instantaneous swap regret for playing the LCE-IX algorithm over $T$ rounds is bounded as follows
\begin{equation}\label{eq:thm1}
\resizebox{1.0\hsize}{!}{$
    R_n^{\rm swap}(T,\mathcal{F}) \leq 
         4K_n \sqrt{T \log (K_n)} +  \left(1+ K_n\sqrt{\frac{T}{\log (K_n)}}\right) \log (\frac{1}{\delta}).
$}
\end{equation}
When $\eta_t = \sqrt{\frac{\log(K_n) + \log(K_n /\delta)}{t}}$ and $\gamma_t = \eta_t/2$, the above bound becomes
\begin{equation}\label{eq:thm11}
\resizebox{1.0\hsize}{!}{$
    R_n^{\rm swap}(T,\mathcal{F}) \leq 
         3K_n \sqrt{T (\log(K_n) + \log(K_n/\delta))} +  \log (\frac{1}{\delta}).
$}
\end{equation}
\end{theorem}
\begin{proof}[Proof Sketch]
The regret defined in (\ref{eq:swap}) can be rewritten in the loss form and can be decomposed as follows:
\begin{equation*}
\resizebox{1.0\hsize}{!}{$
    \begin{aligned}
        R_n^{\rm swap}(T,\mathcal{F}) &\leq \underbrace{\sum_{a\in A_n} ({{L}_{a}^T - \sum_{a\in A_n} \hat{L}_{a}^T})}_{=:\rm (a)}  + \underbrace{\sum_{a\in A_n}({\hat{L}_{a}^T  - \hat{L}_{a,F(a)}^T})}_{=: \rm (b)}
        + \underbrace{\sum_{a\in A_n}(\hat{L}_{a,F(a)}^T - \tilde{L}_{a,F(a)}^T)}_{=:(c)},
    \end{aligned}
$}
\end{equation*}
%\end{equation*}
where ${L}_{a}^T:=\sum\limits_{t=1}^T \sum\limits_{a^\prime \in A_n} Y_{a,a^\prime}^t$ and $\hat{L}_a^t:= \sum\limits_{t=1}^T \sum\limits_{a^\prime \in A_n}q_{a,a^\prime}^t \hat{Y}_{a,a^\prime}^t$ are the cumulative instantaneous and estimated loss allocated to meta-distribution $Q_{a}^t$ over $T$ rounds, respectively.

Then, we can bound (a) by  $\sum\limits_{a \in A_n}\sum\limits_{t=1}^T  \gamma_t \sum\limits_{a^\prime \in A_n} \hat{Y}_{a,a^\prime}^t$, which is a straightforward result by the definition of $\hat{Y}_{a,a^\prime}^t$
and (b) is bounded by $\frac{K_n\log(K_n)}{\eta_T} +  \sum\limits_{t=1}^T \frac{\eta_t}{2} \sum\limits_{a \in A_n} \sum\limits_{a^\prime \in A_n}  \hat{Y}_{a,a^\prime}^t$ by a refined analysis for the exponential-weighting technique.
Finally,   invoking Lemma~\ref{crl:1} can bound (c) and term $\sum\limits_{t=1}^T  (\gamma_t + \frac{\eta_t}{2}) \sum\limits_{a \in A_n}\sum\limits_{a^\prime \in A_n} \hat{Y}_{a,a^\prime}^t$. The detailed proof can be found in Appendix B in the supplementary materials.
%As $\sum\limits_{a \in A_n}\sum\limits_{a^\prime \in A_n} \hat{Y}_{a,a^\prime}^t \leq K_n$, the theorem follows by substituting with $\eta_t$ and $\gamma_t$.

\end{proof}
Note that it is not required for all agents to play LCE-IX at the same time to guarantee Theorem~\ref{thm:regret3}. The value of $\eta_t$ for the bound in (\ref{eq:thm1}) is independent of $\delta$, which means the bound holds for all $\delta$. On the other hand, the high-probability bound in (\ref{eq:thm11}) is improved when the algorithm can use a fixed confidence level $\delta$ to tune its parameters. The former bound, however, is useful for deriving an expected swap-regret bound as shown in the following corollary.
%If considering the time-averaged swap regret, LCE-IX has an upper bound for the time-averaged swap regret of $O(K_n\sqrt{\frac{ \log(K_n))}{T}}$. This means if playing LCE-IX for a long time, the time-averaged regret will converge to $0$, i.e., the gap between LCE-IX and the optimal algorithm in hindsight will disappear over time. In addition, the dependence on $\delta$ can be improved to $\sqrt{\log\frac{1}{\delta}}$ for certain fixed $\delta$.

\begin{corollary}\label{thm:expected}
    With $\eta_t = \sqrt{\frac{\log (K_n)}{t}}$ and $\gamma_t = \eta_t/2$, the expected swap regret is bounded as follows:
    \begin{equation*}
        \mathbf{E}[R_n^{\rm swap} (T, \mathcal{F})] \leq 4K_n \sqrt{T\log(K_n)} + K_n\sqrt{\frac{T}{\log (K_n)}} + 1
    \end{equation*}
\end{corollary}
\begin{proof}
    Let $W:= \frac{R_n^T (T, \mathcal{F}) - 4K_n  \sqrt{T\log(K_n)}}{1+K_n\sqrt{\frac{T}{\log (K_n)}}}$. By (\ref{eq:thm1}), we have that $\Pr(W>\log (\frac{1}{\delta}))\leq \delta$. Then,  integrating the tail gives $\mathbf{E}[W] \leq \int_{0}^{1} \frac{1}{\delta} \Pr(W>\log (\frac{1}{\delta})) d \delta \leq 1$.
\end{proof}
The expected swap regret is upper bounded in $O(K_n \sqrt{T \log (K_n)})$, which we claim is near-optimal because there is a gap of $O(\sqrt{K_n})$ from the lower bound of $\Omega(\sqrt{K_n T \log(K_n)})$ proved in \cite{ito2020tight}. However, the lower bound there is tight for the full-information model, but may not be tight with the bandit feedback.

\subsection{Convergence to Correlated Equilibria}
If every agent involved in the game plays the LCE-IX algorithm at the same time, the following theorem guarantees that the empirical distribution $\hat{\mathbf{P}}^T$ of the joint actions converges to an $\epsilon$-correlated equilibrium.

\begin{theorem}~\label{thm:correlatedequalibrium3}
If every agent $n\in \mathcal{N}$ plays the LCE-IX algorithm for $T$ rounds, then the empirical distribution of the joint actions played by all agents $\hat{\mathbf{P}}^T$ is an $\epsilon$-correlated equilibrium with probability at least $1-\delta$, where $\epsilon= O(\max\limits_{n\in\mathcal{N}}K_n\sqrt{\frac{\log (K_n N/\delta)}{T}})$. When $T \rightarrow \infty$, the empirical distribution of the joint actions converges to the set of correlated equilibria almost surely.
\end{theorem}
\begin{proof}
Let $\delta^\prime >0$. By (\ref{eq:thm11}), with probability $1-\delta^\prime$, $R_n^{\rm int} (T) \leq 3K_n \sqrt{T (\log(K_n) + \log(K_n/\delta^\prime))} +  \log \frac{1}{\delta^\prime}$ for agent $n$. By using the union bound over all the $N$ agents and letting $\delta^\prime = \delta/ N$, we have that with probability at least $1-\delta$: $     \sum\limits_{\mathbb{A}:a_n = a} \hat{\mathbf{P}^T}(\mathbb{K})\left(r_{n}\left(a^{\prime} ;{\mathbb{A}_{-n}}\right)-r_{n}\left(\mathbb{A}\right)\right)
    =  \frac{1}{T}R_n^{\rm int} (T)  \leq O(\max\limits_{n\in\mathcal{N}}K_n\sqrt{\frac{\log (K_n N/\delta)}{T}})$. When $T \rightarrow \infty$, by the Borel-Cantelli Lemma, we have $\limsup\limits_{T \rightarrow \infty} \frac{1}{T} R_n^{\rm int} (T) \leq 0 $ almost surely, which indicates the empirical distribution of joint actions converges to the set of correlated equilibria.


% Assume we find an $\epsilon$-correlated equilibrium for a game with $N$ agents in $T$ rounds. Then, we have
% \begin{equation*}yT
% \begin{aligned}
%     \epsilon T =  O(\max_{n\in \mathcal{N}} \sqrt{TK_n\log (K_n N/\delta)}),
% \end{aligned} 
% \end{equation*}
% which gives $T= O(\max\limits_{n\in \mathcal{N}} \frac{K_n \log(K_n N/\delta)}{\epsilon^2})$.
\end{proof}
Solving the equation $\epsilon= O(\max\limits_{n\in\mathcal{N}}K_n\sqrt{\frac{\log (K_n N/\delta)}{T}})$ for $T$ implies that the empirical joint distribution $\hat{\mathbf{P}}^T$ for all agents meets the definition of an $\epsilon$-correlated equilibrium for the unknown games after $T = \Omega(\max\limits_{n\in \mathcal{N}} \frac{K_n^2 \log(K_n N/\delta)}{\epsilon^2})$ rounds, i.e.,  the equilibrium is achieved.


\subsection{Time and Space Complexity}
In each round, each agent needs first to calculate $P_n^t$ based on (\ref{eq:pcal}), which can be regarded as the calculation of a stationary distribution for the Markov process defined by $Q_n^t$, and can be achieved within $O(K_n^2)$ for $K_n$ states~\citep{feinberg1987method}. Then, each meta-distribution needs $O(K_n)$ time to be updated for $K_n$ arms. Therefore, the time complexity for LCE-IX is $O(K_n^2)$.
Regarding the space complexity,  we need to maintain $K_n$ meta-distributions for the LCE algorithm, and each meta-distribution requires $O(K_n)$ space for $K_n$ arms, so the space complexity is $O(K_n^2)$. 



\section{Numerical Experiments}\label{sec:experiments}
In this section, we compare LCE-IX with LCE~(i.e., $\gamma_t = 0$) to show the effectiveness of the IX technique. We also compare with a recent algorithm with the full-information feedback called BM-Opt-Hedge~\citep{chen2020hedging}. The results are the average of $100$ independent experiments.

We study a wireless medium access game between two wireless devices~(i.e., two agents),  where the two wireless devices are \emph{hidden} from each other~(i.e., each device cannot observe each other), and trying to access one unknown channel in each round.  
Each device has two options in each round, wait for the next round~(W) or access in the current round~(A).
If a device chooses action W, it will receive a reward of $0$. 

If a device chooses to access (A), the device has an energy cost of $0.2$. When only one device successfully accesses the medium, then this device will receive a reward of $0.8$. If both devices choose action A, then there is a collision and the rewards for both devices are $-0.2$ due to the wasted transmission energy. The reward matrix~(unknown to the agents) is shown in Table~\ref{table:rewardmatrix1}.


We assume that all the devices do not adopt the RTS/CTS mechanism, an oft-used technique to solve the hidden terminal problem, as it will introduce new problems among its control messages~\cite{sobrinho2005rts} and the game model still applies to the RTS/CTS message itself. 
Thus, it is quite challenging to improve the received rewards~(i.e., the successful access to the channel) for both devices in a distributed way, as \emph{both wireless devices are hidden terminals to each other so that they cannot observe the actions and rewards of each other and they do not know the total number of devices.} 

\begin{table}[h]
\centering
\caption{The reward matrix for the medium access game}\label{table:rewardmatrix1}
\begin{tabular}{l|cc} 
\hline
  & W     & A      \\ 
\hline
W & $(0,0)$ & $(0,0.8)$  \\
A & $(0.8,0)$ & $(-0.2,-0.2)$  \\
\hline
\end{tabular}
\end{table}

As the swap regret is a generic performance measure, different swap functions can lead to different regret definitions. In this experiment, we show two metrics that reflect two different regret definitions. The first metric is the time-averaged reward, the gap of which from the optimal actions in hindsight reflects the external regret of an online learning algorithm.  The other metric is the convergence to the $\epsilon$-correlated equilibrium. Whether or not an $\epsilon$-correlated equilibrium can be reached reflects whether an online learning algorithm can minimize its internal regret.


\subsection{Time-averaged Reward}
To save space, we only show the time-averaged reward for all the considered algorithms, as it contains the equivalent information about the cumulative regrets or rewards. For example, if an algorithm has higher time-averaged rewards~(or closer to the maximum rewards), then it also has higher cumulative rewards~(or lower cumulative regrets). 

To compare with a benchmark, we consider an adaptive access technique~(denoted by {\tt Ada}) with the prior knowledge about the number of all the hidden devices,
%that is oft-used by current WiFi devices
which randomly accesses a channel with an initial probability $\frac{1}{2}$ for two devices. If a device fails, the access probability of that device is reduced by half; otherwise, the device uses the initial probability in the next round.
{\tt Ada} in our experiments can achieve better performance than the distributed coordination function of IEEE 802.11 used by current WiFi devices in real-world scenarios, as {\tt Ada} in our experiments can set an appropriate initial probability to achieve high throughput.
Therefore, {\tt Ada} can be a good benchmark with partial prior knowledge to show the effectiveness of the swap-regret-minimizing algorithms. 

In addition, the maximum time-averaged reward~(denoted by {\tt Opt}) of $0.4$ can be achieved,  by a mediator~(e.g., wireless access point) with full prior knowledge which either asks Agent 1 to play W and Agent 2 to play A, or asks Agent 1 to play A and Agent 2 to play W in each round. We show that LCE-IX can approach {\tt Opt} quickly in a distributed fashion over time.

The time-averaged rewards of both agents in $1\times 10^4$ rounds are shown in Fig.~\ref{fig:time-averagedresult2}.  As we can see, LCE-IX outperforms both LCE and {\tt Ada}($\frac{1}{2}$) in terms of the faster convergence to {\tt Opt}. This shows the effectiveness of the $\gamma_t$-biased estimator in smoothing the reward estimation so that the low-reward arm can still be explored occasionally. We can also see that BM-Opt-Hedge achieves the fastest result, but we note that BM-Opt-Hedge is with the full-information feedback. 


\begin{figure}[h]
\centering
\subfloat [Agent 1] {
\includegraphics[width = 0.65\columnwidth]{Figures/time_averaged_agent1.eps}
}

\subfloat [Agent 2] {
\includegraphics[width = 0.65\columnwidth]{Figures/time_averaged_agent2.eps}
}
\caption{The time-averaged reward for both agents.}\label{fig:time-averagedresult2}
\end{figure}


\begin{figure}[h]
\centering
\subfloat [LCE]{\label{fig:empdistlce} 
\includegraphics[width = 0.65\columnwidth]{Figures/empirical_probability_distribution.eps}
}

\subfloat [LCE-IX] {\label{fig:empdist}
\includegraphics[width = 0.65\columnwidth]{Figures/empirical_probability_distribution2.eps}
}

\subfloat [BM-Opt-Hedge] {\label{fig:empdist2}
\includegraphics[width = 0.65\columnwidth]{Figures/empirical_probability_distribution3.eps}
}
\caption{The empirical distribution of joint actions by two agents in $T$ rounds.}
\end{figure}
\subsection{Convergence to the $\epsilon$-Correlated Equilibrium}
The convergence of empirical distribution of joint actions played by the two agents in $T$ rounds is shown in Fig.~\ref{fig:empdist}, where (W,W) means both agents play action W, (W,A) means Agent 1 plays W and Agent 2 plays A and so on.
We take the result of LCE-IX to explain the convergence to the correlated equilibrium.
The final results in Fig.~\ref{fig:empdistlce} are $\hat{\mathbf{P}}^T(W,W) = 0.0088$, $\hat{\mathbf{P}}^T(W,A) = 0.4501$, $\hat{\mathbf{P}}^T(A,W) = 0.4501$, and $\hat{\mathbf{P}}^T(A,A) = 0.091$.
We can do a simple calculation to verify this empirical distribution is a correlated equilibrium~($\epsilon =0$). For example, the expected incentives for Agent 1 to switch from W to A are
$\hat{\mathbf{P}}^T(W,W) \cdot u_1(A,W) + \hat{\mathbf{P}}^T(W,A) \cdot u_1(A,A)- (\hat{\mathbf{P}}(W,W) \cdot u_1(W,W) + \hat{\mathbf{P}}^T(W,A) \cdot u_1(W,A))= -0.08298<0,
$
showing that Agent 1 does not have incentives to switch from W to A when both agents follow the joint distribution $\mathbf{\hat{P}}^T$.
In the same way, we can verify the empirical joint distribution $\mathbf{\hat{P}}^T$ is a correlated equilibrium for both agents. 

Fig.~\ref{fig:empdist} shows that LCE-IX has a faster convergence than LCE to a correlated equilibrium, as the empirical probabilities of the optimal action pairs of $(A,W)$ and $(W,A)$ increase faster than that of LCE. This again shows the effectiveness of the $\gamma$-biased estimator in controlling the variation of the reward estimation. 

On the other hand, with full-information feedback, BM-Opt-Hedge can achieve a faster convergence rate than LCE and LCE-IX. It will be our future interest to study whether the techniques of BM-Opt-Hedge can be applied to the bandit-feedback model to speed up the convergence rate.



\section{Conclusion}\label{sec:conclusion}
In this paper, with regard to the randomness of all agents' actions, we provided a high-probability bound for the instantaneous swap regret, which can further bound the expected swap regret.
Furthermore, we conducted numerical experiments to verify the performance of LCE-IX.

Regarding future work, we will study the swap regret bounds for mirror descent algorithms, and aim to close the gap between the upper bound and the lower bound for swap regret. 

\begin{acknowledgements}
We want to thank Prof. Nishant Mehta for his careful reading of the paper and pointing out an issue in Lemma~\ref{lm2}, which has now been fixed.
\end{acknowledgements}
% References
\bibliography{uai2023-template}
\end{document}
