%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{lipsum}		% Can be removed after putting your text content
\usepackage{graphicx}

\usepackage{setspace}
\usepackage{fullpage,graphicx,psfrag,amsmath,amsfonts,verbatim,tabularx,multirow,amssymb}
\usepackage{color,soul}
%\usepackage[small,bf]{caption}
\usepackage{placeins}
\usepackage{tikz}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}% http://ctan.org/pkg/algorithmicx
\usetikzlibrary{bayesnet}
\usetikzlibrary{arrows}
\usepackage{color}
\usepackage[justification=centering]{caption}
\usepackage{subcaption}
\newcommand{\multilinecomment}[1]{}
\usetikzlibrary{backgrounds}
\usepackage{amsthm}
\usepackage{hyperref}
\usepackage{natbib}      % Allows you to use BibTeX
 %\bibliographystyle{plain}
%\doublespacing
\usepackage{booktabs}
\usepackage{mathtools}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newcount\Comments  % 0 suppresses notes to selves in text
\Comments=0 % TODO: set to 0 for final version

\definecolor{darkgreen}{rgb}{0,0.5,0}
% \kibitz{color}{comment} inserts a colored comment in the text
\newcommand{\kibitz}[2]{\ifnum\Comments=1\textcolor{#1}{#2}\fi}
% add yourself here:
\newcommand{\ambuj}[1]{\kibitz{darkgreen}{[AT: #1]}}
\newcommand{\adigi}[1]{\kibitz{blue}{[AD: #1]}}

\newcommand{\remaintransient}{{H}}
\newcommand{\slackmaximin}{\eta_{m}}
\newcommand{\slackexpl}{\eta_{e}}
%\newcommand{\transienceN}{\frac{2\mathbb{E}(\tau_{\text{Re}})}{\delta}}
\newcommand{\transienceN}{K}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{definition}{Definition}
\newtheorem{assumption}{Assumption}
\newtheorem{exmp}{Example}[section]
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\makeatletter
\newtheorem*{rep@theorem}{\rep@title}
\newcommand{\newreptheorem}[2]{%
\newenvironment{rep#1}[1]{%
 \def\rep@title{#2 \ref{##1}}%
 \begin{rep@theorem}}%
 {\end{rep@theorem}}}
\makeatother
\newreptheorem{theorem}{Theorem}
\newreptheorem{lemma}{Lemma}
\newenvironment{sketch}{\paragraph{\normalfont \textit{Proof Sketch.}}}{\hfill$\square$}

\newcommand{\centered}[1]{\begin{tabular}{l} #1 \end{tabular}}

\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\DeclarePairedDelimiter{\floor}{\lfloor}{\rfloor}
\newtheorem{remark}[theorem]{Remark}

\usepackage{xr-hyper}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\title{Balancing Adaptability and Non-exploitability in Repeated Games}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<adigi@umich.edu>?Subject=Your UAI 2022 paper}{Anthony~DiGiovanni}{}}
\author[1]{Ambuj~Tewari}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistics\\
    University of Michigan\\
    Ann Arbor, MI, USA
}
  
  \begin{document}
\maketitle

\begin{abstract}
  We study the problem of adaptability in repeated games: simultaneously guaranteeing low regret for several classes of opponents.
  %guaranteeing low regret in repeated games against an opponent that may be in any of several classes.
  %To this objective
We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some ``fair'' value.
Our solution is an expert algorithm (LAFF),
which searches within a set of sub-algorithms that are optimal for each opponent class,
and
punishes evidence of exploitation by switching to a
policy that enforces a fair solution.
%uses a punishment policy upon detecting evidence of exploitation by the opponent.
With benchmarks that depend on the opponent class, we first show that LAFF has sublinear regret uniformly over 
these classes.
%the possible opponents, 
Second, we show that LAFF discourages exploitation, 
because exploitative opponents have linear regret.
%by
%except exploitative ones, for which we guarantee that the opponent has linear regret. 
To our knowledge, this work is the first to provide guarantees for both regret and non-exploitability in multi-agent learning.
\end{abstract}

\section{Introduction}\label{sec:intro}

%Repeated general-sum matrix games
General-sum repeated games
represent interactions between agents aiming to maximize their respective reward functions, with the possibility of compromise over conflicting goals. Despite their simplicity, achieving high rewards in such games is a challenging learning problem due to the complex space of 
%opponents an agent may face.
possible opponents.
%in a population.
Both the behavior of a given opponent 
%over the course of 
throughout
a game, and that opponent's choice of learning algorithm, may depend on one's own algorithm.
\citet{C20} 
%has argued, 
argues,
based on empirical studies of repeated game tournaments, that a successful agent must achieve two goals. First, it must optimize its actions with respect to its beliefs about the opponent. Second, it should act such that
the opponent forms beliefs
motivating a response that is beneficial to the agent.
%whose corresponding best response is maximally profitable to the agent.
%that motivate a profitable response for the agent.
%and such that its intentions are clear to the opponent.

In particular, multi-agent reinforcement learning (MARL) features the following tradeoff: how to adapt to a variety of 
%other agents one's algorithm might face, 
potential opponents,
while also actively shaping other agents' models of
%itself
oneself
such that they respond with cooperation, rather than exploitation.
If
%a learning algorithm
an agent
commits to a
%small set of policies
fixed policy
to ``lead'' the other player's best response \citep{LS01}, it may perform arbitrarily poorly against players that do not converge to such a response. This motivates the design of adaptive algorithms that try to lead,
%other agents,
but can
%fall back
retreat
to a ``Follower'' (best response) approach if doing so gives greater rewards \citep{PS05, ICML10-chakraborty}.
An effective algorithm in this class is S++ \citep{C14}, which, 
%because it uses a 
due to its
Follower sub-algorithm, has the drawback that it is exploitable\textemdash that is, it rewards agents insisting on unfair bargains (``bully'' strategies) 
%\citep{LS01, CO18, SLRC21}.
\citep{CO18, SLRC21}.

A simple motivating example of Follower exploitability is the game of Chicken (Figure \ref{fig:chicken}),
between players Row and Column. 
%If 
Suppose Column knows
Row
will take
%takes
the apparently optimal action 1 %against a column player that
if Column %repeats
repeats action 2.
%, column 
Column
will then want to use the Leader strategy of committing to action 2 to gain the highest reward. Row thus only gets reward 0.25, and if Column has truly committed, an attempt by Row to dissuade this strategy by taking action 2 would give both players reward 0.
A cooperative outcome, e.g., alternating between the off-diagonal cells, could be achieved if Row's learning algorithm were designed to \textit{publicly disincentivize} commitments
%to action 2.
to the exploitative Leader strategy.

\begin{figure}[ht]
    \centering
    \begin{tabular}{|c|c|}
\hline
    0.5, 0.5 & 0.25, 1  \\
\hline
    1, 0.25 & 0, 0\\
\hline
\end{tabular}
    \caption{Reward bimatrix for Chicken.}
    \label{fig:chicken}
    %\Description{Bimatrix of rewards for the game of Chicken. From top-left to bottom-right: (2, 2), (1, 4), (4, 1), and (0, 0).}
\end{figure}

MARL research has largely neglected the latter half of the adaptability vs. non-exploitability tradeoff.
%,
%which is a problem of incentives for algorithm choice.
Existing algorithms are either evaluated solely by 
%optimality of
their
rewards \textit{conditional} on given opponents \citep{PS05, C14}, or, when the evaluation criterion does account for the incentives of algorithm selection,
the pool of competitor algorithms typically excludes bully strategies \citep{CG10}.
Previous MARL algorithms addressing the adaptability half of the tradeoff lack finite-time guarantees on rewards.
We aim to provide a theoretically grounded algorithm for repeated games that is both adaptable, by using Leader and Follower sub-algorithms, and non-exploitable.
%added part
%\textbf{
More broadly,
this paper addresses a challenge of interest in several
areas of machine learning:
designing algorithms that account for how the distribution of data the algorithms are applied to may change based on the choice of the algorithms themselves.
%}
%(e.g., in our context, choosing an exploitable algorithm encourages other agents to use algorithms that are exploitative).}

\paragraph{\textbf{Related work}} 
%\adigi{to do: reorg into (1) adapt but expl, (2) non-expl but not adapt, and (3) discussion of both (learning eq)}
Previous algorithms for repeated games have
combined Leader and Follower modules,
%Algorithms combining Leader and Follower modules have been previously developed for repeated games,
aiming for
%all
the following guarantees: worst-case safety, best response to players with bounded memory, and convergence in self-play to Pareto efficiency, i.e., an outcome in which no player can do better without the other doing worse \citep{PS04}.
%added part
%\textbf{
Like ours,
these algorithms aim for adaptability,
but they do not have regret guarantees --- the desired
properties are only
shown to hold asymptotically.
%}
%\adigi{Like us, these works aim for adaptability, but they do not
%have regret guarantees --- only asymptotics.}
Manipulator \citep{PS05} achieves these properties by starting with a fixed strategy
%maximizing
that maximizes the user's rewards conditional on the opponent using a best response, and switching to
%model-based
reinforcement learning (RL) with a safety override if
%the fixed
that
strategy does not yield its target rewards.
Related to the self-play guarantee,
we prove a more general property of Pareto efficiency against effective RL algorithms (see Section \ref{sec:sub:rgclass}).
%Our approach resembles Manipulator by testing sub-algorithms sequentially.
Like Manipulator, our approach tests
sub-algorithms sequentially.
S++ \citep{C14} 
%and its extension to episodic stochastic games, Mega-S++ \cite{C15}, have
has empirically strong performance
%with respect to
on
the guarantees above.
%Unlike ours, its Leader sub-algorithms do not use randomization, and so the target solutions are not necessarily optimized for desirable bargaining properties (see Section \ref{bargtheory}). Further,
However,
neither of these algorithms guarantee non-exploitability.
%, and a tournament study \citep{CO18} finds that S++ can be ``bullied,'' that is, converges to a best response to unfair opponents.

%added part
%\textbf{
Although to our knowledge
no previous works have proven non-exploitability in our sense,
several algorithms are designed to
achieve ``fair'' Pareto efficiency
in self-play without using
Follower approaches that would be exploitable.
%}
\citet{LS05}'s algorithm for 
%constructing
%offline
computation of
Nash equilibria, like our Leader sub-algorithms, enforces a Pareto efficient outcome 
%of the game
by punishing deviations.
%added part
%\textbf{
If an agent played this equilibrium, which satisfies properties of symmetry similar to
the outcome our Egalitarian Leader sub-algorithm aims for, it would be non-exploitable.
%}
However, committing to this equilibrium
precludes
%prevents an agent from
learning a best response to fixed strategies that offer higher rewards than the cooperative solution, or exploiting adaptive players, which our Conditional Follower and Bully Leader sub-algorithms achieve, respectively.
In two-player bandit problems where the reward bimatrix must be learned, UCRG \citep{TD20} has
%$\mathcal{O}(T^{2/3}\log(T))$ 
near-optimal
regret in self-play with respect to the egalitarian bargaining solution
(Section \ref{bargtheory}).
%;
%added part
%\textbf{
%that is, UCRG \textit{learns} a policy
%with properties analogous to 
%\citet{LS05}'s equilibrium.
%}
%, where $T$ is the number of time steps in the game.
%Though it is an innovative synthesis of online learning and bargaining,
However, it cannot provably cooperate with 
%other agents besides
agents other than
itself, learn best responses, or exploit adaptive players.
%...
%Several game-theoretic principles used by our algorithm,
%such as public randomization, enforceability, and the egalitarian bargaining solution (EBS),
%have been studied outside the RL context.
%\citet{MS06} discuss a formal model in which players correlate their actions by conditioning on a public random signal.
%Our algorithm uses these signals to achieve Pareto efficient outcomes that would be impossible with independent play.

Our objectives of adaptability and non-exploitability are inspired by work on learning equilibrium \citep{BT04, fcl, CR21}, a solution concept in which players' \textit{learning algorithms} are in a Nash equilibrium, beyond merely the equilibrium of an individual game itself.
This objective accounts for the dependence of the problems faced by multi-agent learning algorithms on the design of such algorithms. 

\paragraph{\textbf{Contributions}} We propose an algorithm (LAFF) that, to our knowledge, is the first proven to have both strong performance against different classes of players in repeated games and a guarantee of non-exploitability, formalized in Section \ref{sec:sub:regretdef}. Specifically, these classes consist of stationary
%bully 
algorithms (``Bounded Memory''), unpredictable adversaries (``Adversarial''), and adaptive RL agents (``Follower''). 
%The modular design of LAFF
LAFF's modular design
allows for extensions to a broader variety of opponent classes in future work. We propose regret metrics appropriate for games against Followers, based on the goal of Pareto efficiency. Our method of proof of adaptability and non-exploitability is novel, applying ``optimistic'' principles at two levels. First, LAFF starts with the sub-algorithm (or \textit{expert}) that would give the highest expected rewards 
%conditional on the opponent being 
if the opponent were
in that expert's target class (``potential''), then proceeds through experts in descending order of 
%this
potential.
%expected reward.
Second, LAFF chooses whether to switch experts by comparing the potential 
%expected reward
of the active expert with its empirical average reward plus a slack term, which decreases with the time for which the expert is used.
For non-exploitability and regret against Followers, we use the properties of an enforceable bargaining solution (see Section \ref{bargtheory}) to upper-bound the other player's rewards.

\section{Preliminaries}\label{sec:prelim}

We study a special class of Markov games: repeated games with a bounded memory state representation \citep{PS05} and public randomization.

\subsection{Setup and Opponent Classification}\label{sec:sub:rgclass}

\noindent Consider a repeated game over $T$ time steps, defined for players $i=1,2$ by action spaces $\mathcal{A}^{(i)}$,
%reward functions $r^{(1)}$ and $r^{(2)}$,
reward matrices $\mathbf{R}^{(i)}$,
and a fixed player memory length $K \in \mathbb{N}$. Here, all $\mathbf{R}^{(i)}(a^{(1)}, a^{(2)}) \in [0,1]$ are known by both players.
At time~$t$ the following random variables are drawn: $S_t$ for state, $A_t^{(i)}$ for actions, and $R_t^{(i)} = \mathbf{R}^{(i)}(A_t^{(1)},A_t^{(2)})$ for rewards.
A state space $\mathcal{S} := (\mathcal{A}^{(1)})^K \times (\mathcal{A}^{(2)})^K \times \{0, 1\}^{2K+2}$, and transition probabilities $\mathcal{P}(s'|s,a^{(1)},a^{(2)})$ between states, are induced by two features:
(1) the tuple of both players' last $K$ actions, and (2) the tuple of the last $K$ and current outcome of a randomization signal, for each player. (See Section 2.1.2 of \citet{MS06}.)
Thus, players condition their 
%choices of 
actions
on their memory of the last $K$ time steps,
and a signal that permits
correlated action choices.

Formally, let $(w^{(1)}_t, w^{(2)}_t) \in [0, 1]^2$ be weights chosen by the respective players at time $t$,\footnote{We restrict to cases where players commit to a fixed weight, so the effective action space is finite. See the Appendix for details.} and draw $X_t \sim \text{Unif}[0,1]$ independent of all other random variables in the game. Then, letting $y_t^{(i)}$ be the realized value of $Y_t^{(i)} := \mathbb{I}[X_t < w^{(i)}_t]$, the second feature at time $t$ is
%given by 
$(y_{t-K}^{(1)},...,y_t^{(1)},y_{t-K}^{(2)},...,y_t^{(2)})$.
This allows the players to correlate actions through the public signal $X_t$, even if one player unilaterally generates the signal.
%added part
%\textbf{
For instance, in Chicken (Figure \ref{fig:chicken}),
players could flip a fair coin ($w^{(1)}_t = w^{(2)}_t = 0.5$)
at
each time step
and play the pair of actions
leading to the top-right cell
when it comes up heads, otherwise
play the bottom-left cell.
%}
In this framework, at each time step each player has a choice of both a weight $w_t^{(i)}$ and policy $\pi^{(i)}_t: \mathcal{S} \to \Delta^{|\mathcal{A}^{(i)}|}$, a mapping from states to distributions over actions.

Given a fixed policy of player 2, a repeated game is a
%special case of a
Markov decision process (MDP) given by
$(\mathcal{S}, \mathcal{A}^{(1)}, r, p)$
as follows.
Let $a^{(i)}(s)$ be the last action of player $i$
that defines state $s$.
Here, $r: \mathcal{S} \times \mathcal{A}^{(1)} \to [0,1]$ is
%given by 
$r(s, a) = \mathbf{R}^{(1)}(a^{(1)}(s), a^{(2)}(s))$,
%induced by $\mathbf{R}^{(1)}$
%and the state space partly defined by the last joint action,
and $p:\mathcal{S} \times \mathcal{A}^{(1)} \times \mathcal{S} \to [0,1]$ is
%given by
$p(s'|s,a) = \sum_{a^{(2)}} \mathcal{P}(s'|s,a,a^{(2)}) \pi^{(2)}(a^{(2)}|s)$.
%induced by $\mathcal{P}$ and player 2's policy.
A policy is called Markov if it is conditioned only on the current state.

The problem faced by our learner, player 1, depends on which of the following classes player 2's algorithm is in:
\begin{enumerate}
    \item \textit{Bounded Memory}: (i) Player 2 uses a constant $w^{(2)}$, reported at the start of the game; (ii) $\pi^{(2)}$ is Markov
    %(conditioned only on the current state)
    and does not depend on time or player 1's signals $w^{(1)}_t$ or $y_t^{(1)}$; and (iii) for all $s, a^{(2)}$ we have $\pi^{(2)}(a^{(2)}|s) > 0$.\footnote{This relatively strong condition is needed for a concentration result in our analysis, ruling out cases where players remain in a transient state for an unknown time. We need to know the exit time from the transient states to compute the quantity $\overline{r}_{i,
    \tau}^{(2)}$ used by one of our experts. Section \ref{sec:experiments} shows strong results against a Bounded Memory player (FTFT) for which this condition does not hold.} 
    \item \textit{Adversarial}: Player 2 selects actions according to any arbitrary distribution, which may depend on the history of play and on player 1's policy at each time step.
    \item \textit{Follower}: A Follower learns a best response when player 1 is ``eventually stationary'' (formalizing the follower concept in \citet{LS01}), and when the value of that best response meets player 2's standard of fairness. For some fairness threshold $V^{(2)} \geq 0$ (depending on the game), player 2's algorithm has the following properties.
    Suppose that after time $T_0$, player 1 always plays a Bounded Memory algorithm (without condition 3), which induces an MDP of finite diameter $D$ where player 2's optimal average reward is at least $V^{(2)}$.
    Then with probability at least $1-\delta$, player 2's regret up to time $T$ (see Section \ref{sec:sub:regretdef}) is bounded by $C_1T_0 + C_2D(SAT\log(T/\delta))^{1/2}$ for constants $C_1, C_2$. 
\end{enumerate}

A repeated game against a Bounded Memory player is equivalent to a communicating MDP \citep{puterman}.
%We consider this class because
%Leader strategies, a subset of Bounded Memory, are prevalent in repeated games.
%of the prevalence of Leader strategies in repeated games, a subset of Bounded Memory.
%and are not always exploitative, thus it may be useful to follow them.
A Follower formalizes an agent that models \textit{our} agent as an MDP (Leader), and the regret bound in our definition is of a standard form for RL algorithms \citep{optQ}. 
%\adigi{maybe note resemblance to $r$-insistent strategies in Myerson 1991} 
Many MARL algorithms take this approach at least partly \citep{PS05, ICML10-chakraborty, CG10}, hence this is a reasonable class to consider.
%\adigi{Compare Followers who have a nontrivial fairness
%threshold $V^{(2)}$ with
%the ``$r$-insistent'' strategies in \citet{insistent},
%which accept offers in bargaining games if
%and only if their offered reward exceeds
%some threshold.}
%added part
%\textbf{
For example,
\citet{LS05}'s algorithm,
which plays a 
certain
sequence of actions
%that on average produce a joint
%payoff called the Nash bargaining solution \citep{NBS}
and punishes deviations from that sequence,
is Bounded Memory ---
this algorithm does not change its policy
in response to the other player,
but its policy conditions on past actions.
A standard RL algorithm,
which would learn the sequence played by \citet{LS05}'s algorithm
and converge to
an optimal policy against it,
and which is a component of more complex repeated games
algorithms like Manipulator and S++,
is a case of a Follower.
%}

As discussed in \citet{C20},
%and our Table \ref{tab:algo_list},
a large proportion of top-performing algorithms are Bounded Memory (Leaders) or Followers, or switch between the two.
%added part
%\textbf{
These classes
illustrate fundamental
approaches to multi-agent learning
(thus, likely opponents
that our algorithm would face):
Either an agent behaves consistently, trying to shape the learning opponent’s behavior (Bounded Memory), or 
the agent changes policies in a process of learning how the opponent behaves and computing an optimal response to that opponent, possibly subject to fairness standards as they try to avoid exploitation (Follower).
The Adversarial class accounts for opponent behavior between these two extremes, which is difficult to learn in generality, but a
worst-case guarantee
can still be achieved.
%}
We thus restrict to guarantees against formalizations of these classes.
Bounds against a wider variety of opponents would be less theoretically tractable, as far as finding the optimal strategy against one class interferes with performance against another.
(For example, \citet{PS05} note that in the repeated Prisoner's Dilemma,
it is
impossible for an algorithm to guarantee the best
response to an opponent
that may play either grim trigger
%added part
---
%\textbf{
``defect if and only if either
player defected last round''
%}
---
or ``always cooperate.'')
Extending to other opponent classes is an important direction for future work.
%See Table \ref{tab:algo_list} for examples of algorithms in our target classes.


\subsection{Background on Bargaining Theory}\label{bargtheory}

\noindent To define appropriate 
%notions of regret
optimality criteria
for these opponent classes and construct corresponding experts, we use several concepts from bargaining theory.
%added part
%\textbf{
We also illustrate these
concepts in the game of Chicken
from the introduction
(Example \ref{example:barg_concepts}).
%}
Define the \textit{security values} 
%$\mu_{\textsc{S}}^{(1)} := \max_\mathbf{v} \min_\mathbf{w} \mathbf{v}^\intercal \mathbf{R}^{(1)} \mathbf{w}$ and $\mu_{\textsc{S}}^{(2)} := \max_\mathbf{w} \min_\mathbf{v} \mathbf{v}^\intercal \mathbf{R}^{(2)} \mathbf{w}$,
%Define the \textit{security values} 
$\mu_{\textsc{S}}^{(i)} := \max_{\mathbf{v}_i} \min_{\mathbf{v}_{-i}} \mathbf{v}_1^\intercal \mathbf{R}^{(i)} \mathbf{v}_2$,
%added part
%\textbf{
i.e., the rewards that each player can guarantee
regardless of their opponent's actions,
%}
with player 1's maximin strategy as $\mathbf{v}^{(1)}_{\textsc{M}} = \argmax_{\mathbf{v}_1} \min_{\mathbf{v}_2} \mathbf{v}_1^\intercal \mathbf{R}^{(1)} \mathbf{v}_2$.
Let $\mathcal{G} := \{(\mathbf{R}^{(1)}(i,j),$ $\mathbf{R}^{(2)}(i,j)) \ | \ i \in \mathcal{A}^{(1)}, j \in \mathcal{A}^{(2)}\}$,
%added part
%\textbf{
the set of reward pairs achievable
by pure actions in the game.
%}
%The \textit{feasible} reward pairs are the convex combinations of rewards achievable by pure joint actions, $\text{Conv}(\mathcal{G})$. %\textit{Individually rational} reward pairs are those that give both players at least their security values.
An important set of rewards in the computation of enforceable bargaining solutions is the convex polytope $\mathcal{U} := \text{Conv}(\mathcal{G}) \cap \{(u_1, u_2) \ | \ u_1 \geq \mu_{\textsc{S}}^{(1)}, u_2 \geq \mu_{\textsc{S}}^{(2)}\}$,
%added part
%\textbf{
reward pairs that are achievable by randomizing over joint actions and give each player at least their security value.
%}
One reward pair satisfying several desirable properties is the egalitarian bargaining solution (EBS) \citep{TD20}, given by $(\mu_{\textsc{E}}^{(1)},  \mu_{\textsc{E}}^{(2)}) := \argmax_{(u_1, u_2) \in \mathcal{U}} \min_{i=1,2}\{u_i - \mu_{\textsc{S}}^{(i)}\}$.
%One reward pair satisfying several desirable properties is the egalitarian bargaining solution (EBS) \cite{TD20}, given by $(\mu_{\textsc{E}}^{(1)},  \mu_{\textsc{E}}^{(2)}) := \max_{(u_1, u_2) \in \mathcal{U}} \min\{u_1 - \mu_{\textsc{S}}^{(1)}, u_2 - \mu_{\textsc{S}}^{(2)}\}$.

The reward pairs over which we search for optimal benchmark values,
%in regret definitions
described in Section \ref{sec:sub:regretdef},
are subject to the following constraint of enforceability. To our knowledge, this definition, including the formalization of enforceability for finite punishment lengths, has not been provided in previous work on non-discounted games. However, see Definition 2.5.1 in \citet{MS06} for the discounted case.

\begin{definition}
\label{def:enf}
%Let $\mathcal{X}$ be a set of joint actions 
%and $\boldsymbol{\alpha} \in \Delta^{|\mathcal{X}|-1}$ be a probability vector over $\mathcal{X}$. Let $\mathbf{R}^{(i)}(\mathcal{X})$ be the vector of player $i$'s rewards from each point in $\mathcal{X}$, and 
Let $(u_1, u_2) \in \mathcal{U}$ be a convex combination
of points in some set of joint actions $\mathcal{X}$.
Let $r(\mathcal{X}) := \max_{(x_1,x_2) \in \mathcal{X}} \{\max_{j \neq x_2} \mathbf{R}^{(2)}(x_1,j) - \mathbf{R}^{(2)}(x_1,x_2)\}$
be player 2's deviation profit.
%A reward pair $(\boldsymbol{\alpha}^\intercal \mathbf{R}^{(1)}(\mathcal{X}), \boldsymbol{\alpha}^\intercal \mathbf{R}^{(2)}(\mathcal{X})) \in \mathcal{U}$
Then $(u_1, u_2)$ is \textbf{$\epsilon$-enforceable}, relative to a memory length $K$ and $\epsilon > 0$, if:
\begin{align*}
    %K\boldsymbol{\alpha}^\intercal \mathbf{R}^{(2)}(\mathcal{X}) &\geq K\mu_{\textsc{S}}^{(2)} + r(\mathcal{X}) + \epsilon.
    Ku_2 &\geq K\mu_{\textsc{S}}^{(2)} + r(\mathcal{X}) + \epsilon.
\end{align*}
\end{definition}

Intuitively, if player 2 does not deviate from player 1's desired action sequence, player 2 receives
%$\boldsymbol{\alpha}^\intercal \mathbf{R}^{(2)}(\mathcal{X})$
$u_2$
on average
%in expectation
for each of $K$ steps. If player 2 deviates, gaining at most $r(\mathcal{X})$ profit, player 1 may punish with player 2's security value for $K$ steps. We call the total sequence reward ``enforceable'' if 
%it is at least $\epsilon$ more than the total deviation reward.
it exceeds the total deviation reward
by at least $\epsilon$.
Let $\mathcal{U}(\epsilon)$ be the set of $\epsilon$-enforceable rewards in $\mathcal{U}$. Then, the feasible region $\mathcal{U}(\epsilon)$, 
%that the EBS is computed over 
used to compute an enforceable version of the EBS,
shrinks with increasing~$\epsilon$ and decreasing~$K$.

The $\epsilon$-enforceable EBS, which we will use to design one of the Leader experts, is found by solving the optimization problem from Section 3.2.4 of \citet{TD20} under the constraint in Definition \ref{def:enf}.
A similar procedure, applied to the objective of maximizing only player 1's reward, gives the Bully solution for the second
Leader expert.
We provide details on these solutions in the Appendix.

\begin{exmp}
\label{example:barg_concepts}
In Chicken (Figure \ref{fig:chicken}),
both players' security value is 0.25, guaranteed by playing action 1.
The EBS is given by 50\% weight on the top-right action pair, and 50\% on the bottom-left, giving both players $0.625$.
If player~1 plays its half of either action pair in the EBS, player 2 does worse by deviating
(by a margin of at least 0.25), so no punishment
is necessary to enforce
the EBS.
Thus the EBS is enforceable for any $K$ and $\epsilon < 0.375K + 0.25$.
\end{exmp}

\subsection{Objectives}\label{sec:sub:regretdef}

\noindent The metric of regret, which we aim to minimize, varies based on the class of player 2 our algorithm faces. For a player 2 algorithm $\mathfrak{B}$, regret with respect to a benchmark $\mu(\mathfrak{B})$ is $\mathcal{R}(T) := T\mu(\mathfrak{B}) - \sum_{t=1}^T R_t^{(1)}$.

\paragraph{\textbf{Bounded Memory}} By condition 3 for Bounded Memory, player 2 induces a communicating MDP.
Let $\Pi$ be the set of time-independent deterministic Markov policies. Then the state-independent optimal average reward is $\mu_*^{(1)} := \max_{\pi^{(1)} \in \Pi} \lim_{t \to \infty} \frac{1}{t} \mathbb{E}_{ \pi^{(1)}}(\sum_{i=0}^t R_i^{(1)}|S_0)$. Here, $\mu(\mathfrak{B}) = \mu_*^{(1)}$.
%\begin{align*}
%    \mathcal{R}_{BM}(T) &:= T\mu_*^{(1)} - \sum_{t=1}^T R_t^{(1)}.
%\end{align*}

\paragraph{\textbf{Adversarial}} Against an Adversarial player, an appropriate benchmark is the greatest expected value that player 1 can guarantee, no matter player 2's actions. This is player~1's security value: $\mu(\mathfrak{B}) = \mu_{\textsc{S}}^{(1)}$. Note the distinction from \textit{external regret} used in adversarial bandits and MDPs.
While the problem is trivial if player 2 is known to be Adversarial, since one can always play the maximin strategy, our challenge is to maintain low Adversarial regret without losing guarantees on other regret measures. This corresponds to \textit{safety} in multi-agent learning \citep{PS04}.

\paragraph{\textbf{Follower}} The concept of regret against a Follower is more complex.
Player 2's sequence of policies can vary significantly based on
player 1's actions. 
Evaluating our algorithm 
%with respect to 
by
the maximum average reward in hindsight would have to account for this counterfactual dependence \citep{C14}.
However, by considering enforceability, we can define benchmarks by lower bounds on this maximum,
%subject to the constraint given
constrained
by the Follower's fairness value $V^{(2)}$.
We consider two cases depending on $V^{(2)}$, focusing for simplicity on the extremes where the Follower either accepts nothing less than the EBS or accepts any enforceable bargain. In principle, our framework could be extended for other $V^{(2)}$ values.

First, the EBS
%\footnote{Other bargaining solutions such as Kalai–Smorodinsky and Nash, derived from different axioms, have been proposed. In principle our approach is compatible with these alternatives. We consider the EBS for consistency with \citet{TD20}, which combines bargaining and regret minimization perspectives.}
is Pareto efficient, meaning 
%that there is no pair of rewards in $\mathcal{U}$ such that at least 
%one player cannot receive strictly more than their EBS value without the other receiving strictly less than theirs.
we cannot achieve greater than $\mu_{\textsc{E}}^{(1)}$ without player 2 receiving less than $\mu_{\textsc{E}}^{(2)}$.
%To the extent that
When
the EBS can be enforced
%it is possible to enforce the EBS
with a fixed policy, $\mu_{\textsc{E}}^{(1)}$ is thus an appropriate 
%regret 
benchmark if the fairness threshold $V^{(2)}$ is player 2's part of the EBS pair.
%, since we cannot achieve greater than $\mu_{\textsc{E}}^{(1)}$ without player 2 receiving less than $\mu_{\textsc{E}}^{(2)}$.
The EBS is not always enforceable for finite $K$, however. 
In this case, 
%we seek
the enforceable version of the
EBS is 
the maximizer
$(\mu_{\textsc{E},\epsilon}^{(1)}, \mu_{\textsc{E},\epsilon}^{(2)})$ of the objective $f(u_1, u_2) = \min_{i=1,2}\{u_i - \mu_{\textsc{S}}^{(i)}\}$ in $\mathcal{U}(\epsilon)$ for some $\epsilon > 0$. 
For this first case,
we therefore consider $V^{(2)} = \mu_{\textsc{E},\epsilon}^{(2)}$, where player 2 follows conditionally. If $\mathcal{U}(\epsilon)$ is empty, $(\mu_{\textsc{E},\epsilon}^{(1)}, \mu_{\textsc{E},\epsilon}^{(2)}) := (\mu_{\textsc{S}}^{(1)}, \mu_{\textsc{S}}^{(2)})$. We set $\mu(\mathfrak{B}) = \mu_{\textsc{E},\epsilon}^{(1)}$.

The second case is $V^{(2)} = 0$, i.e., player 2 follows unconditionally. Here, we compute the maximizer over $\mathcal{U}(\epsilon)$ of $f(u_1, u_2) = u_1$.
Let $(\mu_{\textsc{B},\epsilon}^{(1)}, \mu_{\textsc{B},\epsilon}^{(2)})$ be the solution to this optimization problem (the \textit{Bully values}), or $(\mu_{\textsc{B},\epsilon}^{(1)}, \mu_{\textsc{B},\epsilon}^{(2)}) := (\mu_{\textsc{S}}^{(1)}, \mu_{\textsc{S}}^{(2)})$ if no solution exists. We define $\mu(\mathfrak{B}) = \mu_{\textsc{B},\epsilon}^{(1)}$.

While these regret metrics 
%address the adaptability half of our motivating tradeoff,
provide standards for
adaptability,
we must also formalize non-exploitability. 
%In contrast to the sense of exploitability used by \citet{GS15},
We seek a guarantee on an algorithm's performance against its best response.
%, rather than the weaker standard of worst-case performance.
%Analogous to the problem discussed in our definition of Follower regret,
It is unclear how to characterize the best response to an algorithm capable of adapting to several opponent classes. Given this, we focus on a tractable and practically relevant subproblem: guaranteeing that the best response to our algorithm is not a ``bully'' in the sense discussed in the introduction, which is the most common exploitative strategy in MARL literature \citep{PS05, LS01,Press10409, LS05}.
Even this weaker guarantee is absent from previous work, and we show numerically in Section \ref{sec:experiments} that this suffices for our algorithm to be in learning equilibrium with itself
(see Section \ref{sec:intro}) in a pool of top-performing algorithms.

\begin{definition}
Let player 2 be Bounded Memory, 
%$\mu_*^{(1)}$ be the optimal reward against player 2, 
and $\mu_{\textsc{M}}^{(1)}$ and $\mu_{\textsc{M}}^{(2)}$ be the expected rewards for players 1 and 2 
when player 1 uses $\mathbf{v}^{(1)}_{\textsc{M}}$ and
player 2 uses $\pi^{(2)}$.
%in the Markov reward process induced by player 1's $\mathbf{v}^{(1)}_{\textsc{M}}$ and $\pi^{(2)}$. 
An algorithm $\mathfrak{A}$ is
%\textbf{$\slackexpl$-non-exploitable with respect to $V^{(1)}$} 
\textbf{$(V^{(1)},\slackexpl)$-non-exploitable} 
if, whenever
%both
$\mu_*^{(1)} < V^{(1)} - \slackexpl$ and $\mu_{\textsc{M}}^{(2)} > \mu_{\textsc{E},\epsilon}^{(2)}$, for all $c > 0$ player 2's regret with respect to $\mu_{\textsc{E},\epsilon}^{(2)} + c$ against $\mathfrak{A}$ is $\Omega(T)$.
\end{definition}

Our algorithm is exploitable if player 2 can profit
(do better than $\mu_{\textsc{E},\epsilon}^{(2)}$)
%by using 
from
a policy against which we cannot achieve close to
some value corresponding to a standard of fairness.
%(This value is a hyperparameter chosen by the user.)
The hyperparameter $V^{(1)}$ tunes the tradeoff
between exploitability and
flexibility to various opponents.
Player 2 does \textit{not} profit from
exploitation if they incur linear regret.
%We restrict to Bounded Memory in this definition because fixed policies (often called Stackelberg leaders) are the most common exploitative strategies in MARL literature \citep{PS05, LS01,Press10409, MS06, LS05}.

\begin{exmp}
In Chicken (Figure \ref{fig:chicken}), let $V^{(1)} = 0.625$ (i.e., the EBS), and consider the following strategies: a) always play action 2, b) always play the opponent's last action,
and c) play the best response to the empirical distribution of the opponent's past actions. Strategy (a) is exploitative Bounded Memory. Thus, we argue that an effective algorithm should avoid playing the ``best response'' of action 1, instead discouraging the use of this strategy by, e.g., consistently playing the EBS (see Egalitarian Leader in the next section). Strategy (b) is also Bounded Memory, but not exploitative since one can achieve at least $V^{(1)}$ against this player on average. Our algorithm should therefore learn the best response to (b). Strategy (c) is a Follower with $V^{(2)} = 0$, thus our algorithm should converge to consistently playing action 2 against (c), achieving the Bully value.
\end{exmp}

\section{Lead and Follow Fairly (LAFF)}\label{sec:ergalgo}

We apply an expert algorithm to a set of experts designed for our target classes. Expert algorithms 
use an active expert to choose an action at a given time,
%choose an action at a given step recommended by an active expert from some set,
and switch active experts based on their relative performance \citep{C14}.
LAFF switches experts sequentially, going to the next expert in a predefined sequence only
if the rewards obtained by its active expert fall short of the current target value.
Some of the experts are also designed to guarantee non-exploitability.

\subsection{Description of Experts}

\noindent %Each expert will take as input an epoch length $H$, used by LAFF. Experts for Bounded Memory and Adversarial opponents also take $(V^{(1)}, \mu_{\textsc{E},\epsilon}^{(2)})$, the fair values chosen for players 1 and 2, and $y$, a slack variable for the Adversarial regret guarantee.
LAFF uses an active expert for an epoch of length $H$ before checking whether to switch. Let $\tau$ be the time elapsed since LAFF started using the current instance of the active expert (at time $t_i + 1$), and define $\overline{r}^{(1)}_{i,\tau} := \frac{1}{\tau} \sum_{t=t_i + 1}^{t_i + \tau} R^{(1)}_t$ and $\overline{r}^{(2)}_{i,\tau} := \frac{1}{\tau-K} \sum_{t=t_i + K + 1}^{t_i + \tau} R^{(2)}_t$. See Figure \ref{flowchart} for a summary of algorithmic elements that these experts depend on.
%In this section, we consider an arbitrary epoch indexed by $i$, letting $t_i$ be the start time of that epoch.

\begin{figure}
\centering
  \tikz{
% nodes
 \node[obs, xshift=-3.5cm] (f) {$\phi_F$}; %
 \node[obs, xshift=-2cm] (e) {$\phi_E$}; %
 \node[obs, xshift=-0.5cm] (m) {$\phi_M$}; %
 \node[obs, xshift=1cm] (b) {$\phi_B$}; %
 \node[latent, rectangle, above=of f, yshift=-0.5cm] (q) {Q-learning};
 \node[latent, rectangle, above=of e, yshift=-0.5cm, xshift=0.75cm] (v) {$\mathbf{v}^{(1)}_{\textsc{M}}$};
 \node[latent, rectangle, above=of e, yshift=-0.5cm, xshift=2.25cm] (p) {$\mathbf{v}^{(1)}_P$};
 %\draw (0,1.69) circle(.36cm);
% plate
% \plate [inner sep=.25cm,yshift=.2cm] {plate1} {(x)(y)(z)(a)(s)} {}; %
% edges
 \edge {q} {f}
 \edge {v} {e,m,b}
 \edge {p} {e,b}
 \edge {e} {f,m}
 }
\caption{Algorithmic components (white) of LAFF's experts (gray). An arrow from one node to another means the former is used in computation of the output by the latter.}
%\Description{Flowchart of LAFF's experts' dependencies on algorithmic components. The arrows between nodes are as follows: from Q-learning to Conditional Follower; from maximin strategy to Egalitarian Leader, Conditional Maximin, and Bully Leader; from punishment strategy to Egalitarian Leader and Bully Leader; and from Egalitarian Leader to Conditional Follower and Conditional Maximin.}
\label{flowchart}
\end{figure}

\paragraph{\textbf{Conditional Follower $(\phi_F)$}}
Recall the
benchmarks $\mu_{\textsc{B},\epsilon}^{(1)}$,
$\mu_{\textsc{E},\epsilon}^{(1)}$, and
$\mu_{\textsc{S}}^{(1)}$from
Section \ref{sec:sub:regretdef}.
To handle cases where
%optimal rewards
$\mu_*^{(1)}$
against a Bounded Memory player 2
lies
between these values,
LAFF uses $\phi_F$ multiple times in the sequence (called ``instances''). This expert starts off equivalent to Optimistic Q-learning \citep{optQ}, whose regret bound
(in an MDP with $S$ states and $A$ actions)
with probability at least $1-\delta$ is $\mathcal{R}_{Q}(\tau, \delta) = \mathcal{O}((SA\log(\frac{\tau}{\delta}))^{1/3}\tau^{2/3})$. After each \textit{subepoch} of length $H^{1/2}$, if $\overline{r}^{(1)}_{i,\tau} < V^{(1)} - \frac{\mathcal{R}_{Q}(\tau, \delta/T)}{\tau}$, this expert switches to the Egalitarian Leader $\phi_E$ (below) for as long as \textit{any} instance of $\phi_F$ is used. Otherwise, it uses Optimistic Q-learning for the next subepoch.

\paragraph{\textbf{Conditional Maximin ($\phi_M$)}}
%First, using any standard zero-sum Nash equilibrium solver, we compute $\mathbf{v}^{(1)}_{\textsc{M}}$.
Initially, $\phi_M$ uses the policy $\pi^{(1)}(\cdot|s) = \mathbf{v}^{(1)}_{\textsc{M}}$ for all $s$. Let $\slackmaximin > 0$ be a slack variable, chosen based on the class of Adversarial players considered in Theorem \ref{hedge}. After each subepoch, if $\overline{r}^{(2)}_{i,\tau} > \mu_{\textsc{E},\epsilon}^{(2)} - \slackmaximin + \sqrt{\frac{\log(T/\delta)}{2(\tau-K)}}$, this expert switches to $\phi_E$ for the rest of the game. Otherwise, it uses $\mathbf{v}^{(1)}_{\textsc{M}}$ for the next subepoch.
%As above, $\phi_M$ proceeds in two phases. First, using any standard zero-sum Nash equilibrium solver, we compute $\mathbf{v}^{(1)}_{\textsc{M}}$. The policy used for the first $h = o(H)$ %$\frac{H}{2}$
%time steps is $\pi^{(1)}(\cdot|s) = \mathbf{v}^{(1)}_{\textsc{M}}$ for all $s$, the maximin policy. We then compute $\overline{r}^{(1)}_{M,i} := \frac{1}{h-K} \sum_{t=t_i+K}^{t_i + h - 1} R^{(1)}_t$ and $\overline{r}^{(2)}_{M,i}  := \frac{1}{h-K} \sum_{t=t_i+K}^{t_i + h - 1} R^{(2)}_t$. If $\overline{r}^{(1)}_{M,i} < V^{(1)} - \frac{z}{2}$ and $\overline{r}^{(2)}_{M,i} > \mu_{\textsc{E},\epsilon}^{(2)} + \frac{\slackmaximin}{2}$, this expert uses $\phi_E$ %\mathbf{v}^{(1)}_P
%for the remainder of the epoch; otherwise, it continues using $\mathbf{v}^{(1)}_{\textsc{M}}$.

\paragraph{\textbf{Egalitarian Leader ($\phi_E$)}} If there is no enforceable EBS, let $\phi_E \equiv \mathbf{v}^{(1)}_{\textsc{M}}$.
Otherwise, let the EBS action pairs be denoted $(a_{\textsc{E}}^{(1)}(y), a_{\textsc{E}}^{(2)}(y))$ for $y=0,1$,
and the weight on the first action pair
be $\alpha_{\textsc{E}}$.
While $\epsilon$-enforceability requires that a punishment of length $K$ is sufficient to make a reward pair player 2's best response, this length may not be \textit{necessary}.
We therefore consider the least harsh punishment (if any) needed to enforce the EBS, that is, the value $K' \leq K$ satisfying $K' = \max\Big\{0, \Big \lceil \frac{r(\{(a_{\textsc{E}}^{(1)}(0), a_{\textsc{E}}^{(2)}(0)), (a_{\textsc{E}}^{(1)}(1), a_{\textsc{E}}^{(2)}(1))\}) + \epsilon}{\mu_{\textsc{E},\epsilon}^{(2)} - \mu_{\textsc{S}}^{(2)}} \Big \rceil \Big\}$.
%If the numerator is negative, that is, player 2 has no $\epsilon$-profitable deviation, $K' = 0$ is possible.

Let $\mathbf{v}^{(1)}_P := \argmin_{\mathbf{v}_1} \max_{\mathbf{v}_2} \mathbf{v}_1^\intercal\mathbf{R}^{(2)}\mathbf{v}_2$, player 1's punishment strategy.
%added part
%\textbf{
Recall that policies in our framework are conditioned on binary signals $Y_t^{(i)}$,
whose distributions are determined
by players' reported weights $w_t^{(i)}$.
%}
Then, for the first ${K'}$ time steps, with the realized value $y_{t}^{(1)}$ of the signal given by $w_t^{(1)} = \alpha_{\textsc{E}}$ for all $t$, $\phi_E$ plays $a_{\textsc{E}}^{(1)}(y_{t}^{(1)})$. (This
%forgiving start
ensures that, if LAFF switches to $\phi_E$ mid-game, player 2 is not punished for
having played actions other than the EBS
before LAFF started signaling enforcement of the EBS.) Afterwards, $\phi_E$ uses the following stationary policy. If, for any of the past $K'$ timesteps, player 2 has played $A^{(2)}_t \neq a_{\textsc{E}}^{(2)}(y_t^{(2)})$
--- i.e., deviated from the EBS ---
the distribution over actions for that state is $\mathbf{v}^{(1)}_P$. Otherwise, $a_{\textsc{E}}^{(1)}(y_t^{(1)})$ is played.
%\adigi{should this maybe be a one-shot NE instead? but then might miscoordinate anyway}

\paragraph{\textbf{Bully Leader ($\phi_B$)}} This expert is defined like $\phi_E$, but using the Bully solution from Section \ref{bargtheory}
(maximizing the selfish objective).
If there is no enforceable solution, given by $(a_{\textsc{B}}^{(1)}(y), a_{\textsc{B}}^{(2)}(y))$ for $y=0,1$ and $\alpha_{\textsc{B}}$, let $\phi_B \equiv \mathbf{v}^{(1)}_{\textsc{M}}$. Otherwise, define $\phi_B$ just as $\phi_E$ for this solution.

\subsection{Algorithm}

We design the selection of experts by LAFF (Algorithm \ref{followfirst}) such that, for any of our target classes, LAFF eventually
%identifies and
commits to the optimal expert against player 2 in a sequence $\{\phi_j\}_j$.
Over an epoch, the active expert is executed,
and we update this expert's average rewards
since it was made active (line \ref{record}). Afterwards, LAFF switches to the next expert in the schedule if and only if it rejects the hypothesis that the current expert's expected value exceeds its corresponding target $\mu_j$ (line \ref{baselinecheck}).
The false positive rate of this hypothesis test is controlled by a function $\mathcal{B}$, which
decreases with 
%the square root of time elapsed since the last switch
$\sqrt{\tau}$.
%(line \ref{tauup}).
We define $\mathcal{B}$ in the proof of Lemma \ref{followregret} (see Appendix).
\multilinecomment{
The false positive rate of this hypothesis test is controlled by a function $\mathcal{B}$ of the time elapsed since the last switch (line \ref{tauup}), defined:
%derived from the regret bound of our Conditional Follower as follows:
\begin{align*}
    \xi(\epsilon, r) &:= \begin{cases}
    \frac{\epsilon}{2K'},& \text{if } r \geq 0\\
    \frac{\epsilon + r}{2K'},& \text{if } -\epsilon < r < 0\\
    -r,              & \text{otherwise},
\end{cases} \\
    \mathcal{B}(\tau) &:= \frac{1}{\tau} \cdot \frac{K'\xi(\epsilon, r(\mathcal{X})) + C_1T_0 + K'+1}{\xi(\epsilon, r(\mathcal{X}))} \\
    %&+ \frac{1}{\tau^{1/2}}\left(\frac{C_2\mathcal{R}_{Q}(\tau, \frac{\delta}{T})}{\tau^{1/2}\xi(\epsilon, r(\mathcal{X}))} + \frac{3 + \xi(\epsilon, r(\mathcal{X}))}{\xi(\epsilon, r(\mathcal{X}))}\sqrt{\frac{\log(\frac{T}{\delta})}{2}}\right). \\
    &+ \frac{1}{\tau} \cdot \frac{C_2\mathcal{R}_{Q}(\tau, \frac{\delta}{T}) + (3 + \xi(\epsilon, r(\mathcal{X})))\sqrt{\frac{\tau \log(\frac{T}{\delta})}{2}}}{\xi(\epsilon, r(\mathcal{X}))}.
\end{align*}
Where $\mathcal{X} = \mathcal{X}_{\textsc{B}} := \{(a_{\textsc{B}}^{(1)}(y), a_{\textsc{B}}^{(2)}(y))\}_{y=0,1}$ for expert index $j \leq 2$, $\mathcal{X} = \mathcal{X}_{\textsc{E}} := \{(a_{\textsc{E}}^{(1)}(y), a_{\textsc{E}}^{(2)}(y))\}_{y=0,1}$ for $j > 2$, and $\delta > 0$ is some confidence level.
}
Because $\mu_{\textsc{B},\epsilon}^{(1)} \geq \mu_{\textsc{E},\epsilon}^{(1)} \geq \mu_{\textsc{S}}^{(1)}$, and the optimal reward $\mu_*^{(1)}$ against a Bounded Memory player may be greater than $\mu_{\textsc{B},\epsilon}^{(1)}$ or in between these values, $\{\phi_j\}_{j}$ prioritizes the order of experts based on the optimal average reward they could achieve against the corresponding player 2 class (line \ref{initline}).

\section{Analysis}

%The regret guarantee of Algorithm \ref{followfirst} follows by case-by-case analysis of each expert in $\Phi$ against the corresponding player 2, and by showing that LAFF commits to an expert if and only if doing so provides low regret with respect to player 2's class.
%We show that LAFF's hypothesis tests and the Egalitarian Leader ensure non-exploitability.
We will now show that LAFF meets our key criteria of adaptability
and non-exploitability.
%Proofs of lemmas and the detailed proof of Theorem \ref{hedge} are in the Appendix.
See Appendix for proofs of lemmas and the detailed proof of Theorem \ref{hedge}.
Lemma \ref{followregret} shows that
%by the construction of the Leader experts,
with high probability player 2's rewards against $\phi_E$ are not much greater than the EBS
(thus non-exploitability is feasible),
and player 1's rewards against a Follower are near the target when the correct Leader is used.

\begin{algorithm}
\caption{Lead and Follow Fairly (LAFF)}\label{followfirst}
\begin{algorithmic}[1]
\State \textbf{Init} target schedule $\{\mu_j\}_j = \{\mu_{\textsc{B},\epsilon}^{(1)}, \mu_{\textsc{B},\epsilon}^{(1)},\mu_{\textsc{E},\epsilon}^{(1)},\mu_{\textsc{E},\epsilon}^{(1)},$ $\mu_{\textsc{S}}^{(1)}\}$, expert schedule $\{\phi_j\}_j = \{\phi_F, \phi_B, \phi_F, \phi_E,$ $\phi_F, \phi_M\}$, expert index $j = 1$, $\tau = 0$, $R_\tau = 0$%, epoch length $H$
\label{initline}
\For{$i=1,2,\dots,\ceil{T/H}$}
    \For{$t=(i-1)H + 1,\dots,\min\{iH, T\}$}
        \State Run expert $\phi_j$
        \State $R_\tau \leftarrow R_\tau + \mathbf{R}^{(1)}(A^{(1)}_t, A^{(2)}_t)$ \label{record}
    \EndFor
    \State $\tau \leftarrow \tau + H$ \label{tauup}
    \If{$j < |\{\phi_j\}_j|$ and $\frac{R_\tau}{\tau} < \mu_j - \mathcal{B}(\tau)$} \label{baselinecheck}
        \State $j \leftarrow j +1$, $\tau \leftarrow 0$, $R_\tau \leftarrow 0$
    \EndIf
\EndFor
\end{algorithmic}
\end{algorithm}


%\multilinecomment{
\begin{lemma}
\label{followregret}
\textbf{(Reward Bounds When LAFF Leads)} 
If player 1 uses $\phi_E$ over a sequence of length $\tau+K'$ starting at time $t^*+1$, then 
with probability at least $1- \frac{3\delta}{T}$:
%Let $t^*+1$ be the start time in a sequence of length $\tau+K'$. If player 1 uses $\phi_E$ over this sequence, then with probability at least $1- \frac{3\delta}{T}$:
\begin{align*}
    &\sum_{t=t^*+K'+1}^{t^* + K' + \tau} R^{(2)}_t \leq K' + 1 + \tau\mu_{\textsc{E},\epsilon}^{(2)} + 3\sqrt{\textstyle{\frac{1}{2}}\tau\log(\frac{T}{\delta})}.
\end{align*}
If player 2 is a Follower with $V^{(2)} = 0$, and player 1 uses $\phi_B$, then with probability at least $1- \frac{5\delta}{T}$, we have $\overline{r}^{(1)}_{i,\tau} \geq \mu_{\textsc{B},\epsilon}^{(1)} - \mathcal{B}(\tau)$.
If $V^{(2)} = \mu_{\textsc{E},\epsilon}^{(2)}$, and player 1 uses $\phi_E$, then with probability at least $1- \frac{5\delta}{T}$, we have $\overline{r}^{(1)}_{i,\tau} \geq \mu_{\textsc{E},\epsilon}^{(1)} - \mathcal{B}(\tau)$.
\end{lemma}

Lemma \ref{conditional_experts} guarantees that with high probability, LAFF follows or uses the maximin strategy against non-exploitative players, and punishes exploitative players.

\begin{lemma}
\label{conditional_experts}
\textbf{(False Positive and Negative Control of Exploitation Test)} Consider a sequence of $k$ epochs each of length $H$.
%against a Bounded Memory player 2.
Let $m^*_{F}$ or $m^*_{M}$ be, respectively, the index of the \textit{subepoch} within this sequence at the start of which $\phi_F$ or $\phi_M$ switches to punishing with $\phi_E$, if at all (if not, let $m^*_{F}$ or $m^*_{M} = \infty$). Let $\slackexpl \geq \frac{2\mathcal{R}_{Q}(H/2, \delta/T)}{H} + \sqrt{\frac{2S^2A\log(c_0/\delta)}{c_1H}}$, where $c_0, c_1$ are defined as in Theorem 5.1 of \citet{MT05}, and $\slackmaximin \geq \sqrt{\frac{\log(T/\delta)}{2(H/2-K)}} + \sqrt{\frac{64e\log(N_q/\delta^2)}{(1-\lambda)(H/2-K)}}$, where $\lambda$ and $N_q$ are constants with respect to time defined in Lemma 4 (see Appendix).

Then, suppose player 2 is Bounded Memory, and $\phi_F$ is used. If $\mu_*^{(1)} < V^{(1)} - \slackexpl$, then with probability at least $1-\delta$, $m^*_{F} \leq \ceil{\frac{H^{1/2}}{2}}$. If $\mu_*^{(1)} \geq V^{(1)}$, then with probability at most $\frac{kH^{1/2}\delta}{T}$, $m^*_{F} < \infty$. If $\phi_M$ is used, and $\mu_{\textsc{M}}^{(2)} > \mu_{\textsc{E},\epsilon}^{(2)}$, then with probability at least $1-\delta$, $m^*_{M} \leq \ceil{\frac{H^{1/2}}{2}}$.

Suppose player 2 is Adversarial, with a sequence of action distributions $\{\pi^{(2)}_t\}$ such that, for any $M \geq H^{1/2} - K$ and $i$, $\frac{1}{M} \sum_{t=i+1}^{i+M} {\mathbf{v}^{(1)}_{\textsc{M}}}^\intercal \mathbf{R}^{(2)} \pi^{(2)}_t \leq \mu_{\textsc{E},\epsilon}^{(2)} - \slackmaximin$. Then, if $\phi_M$ is used, with probability at most $\frac{kH^{1/2}\delta}{T}$, $m^*_{M} < \infty$.
\end{lemma}
%}

%added part
%\textbf{
Our main result, Theorem \ref{hedge}, claims that 1)
against each of our target classes,
LAFF achieves a regret bound of the same order
as Optimistic Q-learning
in single-agent MDPs \citep{optQ},
and 2) LAFF satisfies non-exploitability.
%}

\begin{theorem}
\label{hedge}
Let $\mathcal{C}$ be the set of player 2 algorithms that are any of the following:
\begin{itemize}
    \item Adversarial, with a sequence of action distributions $\{\pi^{(2)}_t\}$ such that $\frac{1}{M} \sum_{t=i+1}^{i+M} {\mathbf{v}^{(1)}_{\textsc{M}}}^\intercal \mathbf{R}^{(2)} \pi^{(2)}_t \leq \mu_{\textsc{E},\epsilon}^{(2)} - \slackmaximin$ for any $M \geq T^{1/4}$ and $i$,
    \item Follower, with $V^{(2)} \in \{0, \mu_{\textsc{E},\epsilon}^{(2)}\}$, or
    \item Bounded Memory, with 
    %a policy $\pi^{(2)}$ such that
    %in the induced MDP,
    $\mu_*^{(1)} \geq V^{(1)}$.
\end{itemize}
Let $\slackmaximin$ and $\slackexpl$ satisfy the conditions of Lemma \ref{conditional_experts}.
Then, with probability at least $1-5\delta$, LAFF satisfies:
\begin{align*}
    \max_{\mathcal{C}} \mathcal{R}(T) &= \mathcal{O}(\mathcal{R}_{Q}(T, \delta/T)).
\end{align*}
Further, with probability at least $1-6\delta$, LAFF is 
%$\slackexpl$-non-exploitable
$(V^{(1)},\slackexpl)$-non-exploitable 
when there exists an enforceable EBS.
\end{theorem}

If there is no enforceable EBS, $\mu_{\textsc{E},\epsilon}^{(2)} = \mu_{\textsc{S}}^{(2)}$ and so we cannot guarantee player 2 does worse than $\mu_{\textsc{E},\epsilon}^{(2)}$ in expectation.
The class of Adversarial players for which Theorem \ref{hedge} holds is technically restrictive. However, non-exploitability requires that for each strategy (expert) used by our algorithm that could be exploited, including Conditional Maximin, we exclude from our target class some subset of opponents. That is, we cannot guarantee low Adversarial regret against players who receive more than the EBS value against maximin, because such players may exploit us.

\begin{sketch}
For each opponent class, we need to show that with high probability LAFF does not lock in to
%(that is, fail to switch from)
a suboptimal expert for that class. If LAFF locks in to an expert for which the corresponding target value $\mu_j$ is \textit{greater} than the opponent's benchmark $\mu(\mathfrak{B})$, this implies LAFF consistently receives rewards such that ``regret'' with respect to $\mu_j$ grows like $\mathcal{R}_Q$, by design of $\mathcal{B}(\tau)$. But since the benchmark is less than $\mu_j$, the true regret is also bounded as desired.

We therefore only need to consider the cases of $\mu_j \leq \mu(\mathfrak{B})$. First, we know that each expert achieves at most $\mathcal{R}_Q$ regret against its target opponent class, by, respectively: the definitions of $\mathcal{R}_Q$ (for non-exploitative Bounded Memory) and maximin (for Adversarial), and Lemma \ref{followregret} (for Followers). %For the former cases, 
Lemma \ref{conditional_experts} ensures with high probability that $\phi_F$ and $\phi_M$ do not switch to $\phi_E$ when not exploited, so they inherit the desired regret bounds.

Then, we need only show that once LAFF reaches the expert
whose target class matches the opponent
%for which the opponent is in its target class
(thus guaranteeing low regret using that expert), with high probability LAFF does not switch. 
%This necessarily holds, because 
But
if using the corresponding expert gives LAFF low regret with respect to $\mu(\mathfrak{B}) \geq \mu_j$, then its rewards are sufficiently high that the condition for switching experts (line \ref{baselinecheck} of Algorithm \ref{followfirst}) never holds. The first claim of the theorem follows.

\begin{figure*}[ht]
    \centering
    \begin{tabular}{ccc}
       \ \ Unconditional Follower (Q-Learning) & \ \ \ \ \ \ \ \ Conditional Follower (LAFF) & \ \ \ \ \ \ \ Bounded Memory (FTFT) \\
       \includegraphics[width=5.3cm]{figures/1.png} & \includegraphics[width=5.3cm]{figures/2.png} & \includegraphics[width=5.3cm]{figures/3.png} \\
    \end{tabular}
    \begin{tabular}{cc}
        \ \ \ \ \ \ \ \ \ Adversarial (Manipulator) & \ \  \ \ \ \ \ \  Exploitative (Bully) \\
        \includegraphics[width=5.3cm]{figures/4.png} & \includegraphics[width=5.3cm]{figures/5.png} \\
    \end{tabular}
    \caption{The first four plots show LAFF's average regret, in each of 11 games detailed in the Appendix, for the following opponents: Unconditional Follower (Q-Learning), Conditional Follower (LAFF), Bounded Memory (FTFT), Adversarial (Manipulator). The last plot shows the regret of an Exploitative (Bully) algorithm against LAFF.}
    \label{fig:regrets}
    %\Description{Five plots of cumulative regret curves for each of 11 games listed in the Appendix. The x-axis is time, from 0 to 200,000. The y-axis is cumulative regret (for player 1 in the first four plots, and for player 2 in the last). In the first and third plots, all but three games' regret curves rise for less than 10,000 time steps, then remain approximately constant. The remaining curves increase linearly for the whole time horizon. In the second plot, all regret curves become constant after no more than 50,000 time steps, with most flattening after at most 10,000 steps. In the fourth plot, one curve rises linearly, while the others are constant at approximately zero or decrease approximately linearly. In the last plot, all curves increase linearly, with varying slopes; one curve is approximately constant at zero, with a very small slope.}
\end{figure*}

To show non-exploitability, suppose LAFF locks in to the first instance of $\phi_F$. By Lemma \ref{conditional_experts}, $\phi_F$ detects evidence of exploitation sufficiently early that the remaining time left in the game is linear in $T$. After detecting exploitation, $\phi_F$ plays the same policy as $\phi_E$. But by Lemma \ref{followregret}, against this policy player 2 cannot guarantee an average reward greater than $\mu_{\textsc{E},\epsilon}^{(2)}$ plus a term that vanishes at a rate $T^{1/2}$. The second claim of the theorem follows for the other possible locked-in experts as well by considering two facts. First, whenever $\phi_E$ or $\phi_B$ is used, Lemma \ref{followregret} again bounds player 2's rewards, since by Pareto efficiency of the EBS player 2's rewards from the Bully solution cannot exceed $\mu_{\textsc{E},\epsilon}^{(2)}$. Second, if LAFF reaches $\phi_M$, again Lemma \ref{conditional_experts} ensures sufficiently fast detection of exploitation with high probability.
\end{sketch}

\section{Numerical Experiments}\label{sec:experiments}

Code for the experiments in this section is available on
Github.\footnote{\url{https://github.com/digiovannia/ad_expl}} 
We evaluate LAFF by three empirical metrics. First,
we find LAFF's empirical regret against one
algorithm from each target class.
%LAFF plays games against one algorithm from each target class, to check that the regret bounds hold in practice.
Second, LAFF and a set of top-performing repeated games algorithms compete in a round-robin tournament. For each algorithm, we find its rewards against its best response algorithm in this set,
and check if it is in a learning equilibrium by applying a Nash equilibrium solver \citep{Knight2018} to the matrices of empirical rewards for algorithm pairs.
%\footnote{We use Nashpy for Python \citep{Knight2018}.}
These criteria evaluate exploitability: more exploitable algorithms have lower rewards against algorithms that optimize against them,
and an exploitable algorithm cannot be in equilibrium with itself unless the fairness threshold $V^{(1)}$ is low. Finally, we perform a replicator dynamic simulation \citep{CO18}. Each generation, the algorithms' fitness values are computed as averages of the round-robin scores weighted by the distribution of the population of algorithms. Then, the population distribution is updated in proportion to fitness. This evaluates how well a given algorithm performs when the distribution of its opponents is determined by those algorithms' own performance. 
Exploitability is thus implicitly penalized by accounting for opponents' incentives.
Details on the implementation of these experiments are in the Appendix. We set $V^{(1)} = \mu_{\textsc{E},\epsilon}^{(1)}$.

Our set of competitors to LAFF consists of Bounded Memory (Bully, Forgiving Generalized Tit-for-Tat or FTFT), Follower (M-Qubed, Q-Learning, Fictitious Play), and expert (Manipulator, S++) algorithms. See Appendix for details and sources.
%added part
%\textbf{
We chose these algorithms because, first, they performed
well
in a repeated games tournament \citep{CO18},
and second,
they cover our opponent classes.
S++ and Manipulator do not fall cleanly into any of those classes, but they are the closest comparisons in previous literature to LAFF, since they
adapt to a variety of opponents by switching between Leader and Follower experts.
%}

%These were,
%The chosen algorithms in Table \ref{tab:algo_list} were,
%with one exception, top performers in a recent tournament study \citep{CO18}.
%We include FTFT as an example of a Bounded Memory algorithm that, in some games, can have $\mu_*^{(1)} > \mu_{\textsc{B},\epsilon}^{(1)}$ or $\mu_*^{(1)} > \mu_{\textsc{E},\epsilon}^{(1)}$. While rather exploitable, FTFT can avoid cycles of mutual punishment to which Leader strategies are prone, and was highly successful in a Prisoner's Dilemma tournament \citep{SPtournament}.
%Although Q-learning was outperformed by model-based RL in \citet{CO18}, we found the opposite trend in preliminary experiments, so we include the former.
To ensure sufficient diversity of test games, we choose games based on the taxonomy of Figure 1 in \citet{topology}. Six game families
are categorized by the structures of their Nash equilibria.
We use two games from each family, one with symmetric rewards and one with asymmetric, except Cyclic, which has no symmetric games (see Appendix). 

\begin{table*}
\centering
\caption{Rewards of algorithm pairs, averaged over games and trials (pure learning equilibria in are highlighted in bold text, and each algorithm's reward against its best response is in blue)}
\label{tab:emp_matrix}
\begin{tabular}{ccccccccc}
    \toprule
    & S++ & Manipulator & M-Qubed & Bully & Q-Learning & LAFF & FTFT & FP \\
    \midrule
    S++ & 0.75, 0.76 & 0.73, 0.80 & \textcolor{blue}{0.73}, 0.81 & 0.65, 0.77 & 0.82, 0.76 & 0.71, 0.8 & 0.70, 0.68 & 0.72, 0.55 \\
    Manipulator & 0.87, 0.68 & 0.76, 0.71 & 0.77, 0.65 & \textcolor{blue}{0.65}, 0.77 & 0.89, 0.67 & 0.70, 0.65 & 0.71, 0.60 & 0.76, 0.55 \\
    M-Qubed & 0.88, \textcolor{blue}{0.68} & 0.68, 0.68 & 0.80, 0.74 & \textcolor{blue}{0.65}, 0.80 & 0.79, 0.75 & 0.76, 0.73 & 0.78, 0.65 & 0.62, 0.56 \\
    Bully & 0.86, 0.61 & 0.83, \textcolor{blue}{0.60} & 0.85, \textcolor{blue}{0.61} & 0.48, 0.44 & \textbf{\textcolor{blue}{0.91}, \textcolor{blue}{0.63}} & 0.61, 0.49 & 0.72, 0.55 & 0.76, \textcolor{blue}{0.56} \\
    Q-Learning & 0.82, 0.77 & 0.73, 0.83 & 0.79,  0.67& \textbf{\textcolor{blue}{0.68}, \textcolor{blue}{0.85}} & 0.83, 0.74 & 0.71, 0.84 & 0.81, \textcolor{blue}{0.67} & 0.64, 0.56 \\
    LAFF & 0.87, 0.65 & 0.71, 0.66 & 0.74, 0.72 & 0.55, 0.61 & 0.90, 0.66 & \textbf{\textcolor{blue}{0.77}, \textcolor{blue}{0.74}} & 0.80, 0.70 & 0.75, 0.57 \\
    FTFT & 0.64, 0.70 & 0.49, 0.71 & 0.59, 0.76& 0.60, 0.71 & 0.59, 0.78 & \textcolor{blue}{0.61}, 0.78 & 0.80, 0.75 & 0.46, 0.72 \\
    FP & 0.70, 0.73 & \textcolor{blue}{0.66}, 0.74 & 0.66, 0.55 & 0.63, 0.73 & 0.69, 0.57 & 0.61, 0.71 & 0.71, 0.60 & 0.68, 0.55 \\
    \bottomrule
\end{tabular}
\end{table*}

%\subsection{Regret Bounds}

\paragraph{\textbf{Regret Bounds}} Figure \ref{fig:regrets} shows LAFF's regret, averaged over 50 trials, in games against an algorithm from each target class, and the regret of an exploitative Bounded Memory algorithm against LAFF.
We chose Manipulator as ``Adversarial'' because it does not play the EBS and is not a pure Leader or Follower.
However, in the symmetric Unfair game, the empirical rewards indicate that Manipulator attempts to exploit LAFF,
so LAFF punishes Manipulator at the expense of the Adversarial regret guarantee.
From the plot evaluating player 2's regret, we also exclude four games where player 2's Bully solution equals the EBS, since in these cases $\mu_*^{(1)} \geq V^{(1)}$
(player 1 is not exploited by playing the optimal policy).
In most games, LAFF's regret
%is eventually sublinear,
eventually plateaus,
while the exploitative player has linear regret, showing that LAFF is non-exploitable.
In three games, 
%however,
LAFF has linear regret against an Unconditional Follower and non-exploitative Bounded Memory player. This may be due to the practical difficulty of choosing hyperparameters for tests used to decide when to switch to the next expert; these tests depend on some unknown quantities, so for our experiments, we tuned $\mathcal{B}(\tau)$ on a training set of four games that are not included in the set of 11 games for these results
(see Appendix).
Longer time horizons may be required for
%performance to fully match the bounds given in Theorem \ref{hedge}.
the conditions on $\slackexpl$ in Lemma \ref{conditional_experts} to hold.
We used a horizon of $T=2 \cdot 10^5$ to be on
the same approximate scale as experiments in other works on repeated games \citep{CG10, LS05, C14}.
%, and because S++, LAFF's main competitor, does relatively worse in longer horizons than this \citep{C14}.

\begin{figure}[ht]
    \centering
    %\includegraphics[width=7cm]{figures/rep_dyn.png}
    \includegraphics[width=7cm]{figures/rep.png}
    \caption{Replicator dynamic results, where the bold curves are average population shares and shaded regions are plus and minus one standard deviation.}
    \label{fig:rd_results}
    %\Description{Plot of the population share of each of the eight evaluated algorithms across 500 generations, with shaded error bands for each population share curve. Over the first 100 generations, the shares of LAFF and Manipulator both increase, while the others decrease. FTFT and FP in particular decline to 0 percent quickly. After about 100 generations, Manipulator declines as well while LAFF continues to increase, rising to about 100 percent by the 300th generation.}
\end{figure}


\paragraph{\textbf{Round Robin}} Table \ref{tab:emp_matrix} shows the average rewards of each algorithm pair across the 11 games and 50 trials,
which provide an empirical bimatrix for the \textit{learning game}, i.e., a meta-game in which users choose algorithms to deploy across different repeated games.
An algorithm's reward against its best response (highlighted in blue) measures how much it bullies when possible and avoids exploitation.
Both as player 1 and player 2, LAFF is second by this metric, behind Bully. We also highlight the pure strategy Nash equilibria of this learning game (in bold), noting that LAFF is in a learning equilibrium with itself. Unfortunately, the pairing in which Q-Learning follows Bully is also an equilibrium. Thus there is an equilibrium selection problem, e.g., both users might choose Bully and receive very low rewards. However, in practice it may be easier for users to coordinate on both using LAFF, because there is no conflict over choosing which side is the Leader (Bully) versus the Follower (Q-Learning).

%\adigi{interesting that Q pairs so well with itself, indeed a Pareto improvement over LAFF pair although it's of course less stable bc not an equilibrium; optimistic initialization a la M-Qubed really seems to help, even though in theory there's nothing guaranteeing that Q learners will cooperate with each other}

\paragraph{\textbf{Replicator Dynamic}}
On average over 1000 runs,
%of the replicator dynamic,
LAFF converges to 100\% of the population in the pool of algorithms (Figure \ref{fig:rd_results}), based on fitness computed as the \textit{minimum} of an algorithm's average reward over the set of games when playing as player 1 versus player 2. This metric matches the motivation
for the EBS; algorithm users will not know \textit{a priori} which of the two ``sides'' of the game they will be in. Thus, they may prefer their algorithm to cooperate with itself (maximize an egalitarian objective), instead of bullying its copy in hopes of being on the side of the bully.

%\adigi{how sensitive are evo results to initial population? Should we expect this diversity of algos, or just a few niches? Unsure if we want to use space on this point}

\section{Discussion}

When choosing algorithms for multi-agent interactions, users
will have to trade off robustness to the variety
of possible algorithms they might face, with avoiding providing other users incentives to exploit them \citep{SLRC21}.
We have presented an algorithm for repeated games that balances these desiderata.
Both properties can facilitate cooperation between learning agents, while still allowing them to accept generous offers.
If LAFF faces an agent who ``follows'' fair, Pareto efficient bargaining proposals, the Egalitarian Leader leads them to a mutual benefit over their security values.
If the other agent's fairness standard is different, the Conditional Follower can follow this alternative proposal using RL if it is not exploitative;
otherwise, the exploitation penalty encourages the other player to be more cooperative.
Against exploitable agents, the Bully Leader can benefit from a more self-interested bargain.
Finally, if the other player is unwilling to cooperate at all but is not exploitative, Conditional Maximin ensures safety. In future work, more experts can be added
%to the schedule
based on agent classes that we have neglected. For example, while LAFF includes Leader experts only for the extreme cases in which player 2 has a high or minimal fairness standard, one could add Leaders for other bargaining solutions.
%, preserving the descending order of values in the expert schedule.

The biggest limitations of our approach are restrictive assumptions required for our non-exploitability criterion, and the strictness of this criterion. The margin $\slackexpl$ is small only for sufficiently large time horizons,
hence the linear regret in some of our experiments. Though LAFF successfully punishes players against whom it receives less than fair rewards, this is only strategically necessary when such players \textit{benefit} from playing this way (genuine ``exploitation'').
%To only punish in that case, one would need to show that an exploitative player's empirical rewards would not coincidentally drop low enough to avoid detection by this modified hypothesis test.
It may not be practically necessary
to modify the experts to not punish
when the opponent also does worse,
because an opponent would not have an incentive to lead with a Pareto inefficient policy.
%added part
%\textbf{
Finally, we note that 
our approach is not intended to provide the optimal balance of the adaptability-exploitability tradeoff;
in particular, keeping a fixed fairness threshold
may not be ideal if it
prevents an algorithm from
cooperating with algorithms
that follow other intuitively ``fair'' standards \citep{SLRC21}.
%}
%such that \textit{both} players receive worse rewards than the EBS.

\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    Both authors conceived and carried out the research project jointly. A.D.~wrote the paper and code for
    numerical experiments. A.T.~helped edit the paper.
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    A.D.~acknowledges the support of a grant
    from the Center on Long-Term Risk Fund.
\end{acknowledgements}

\bibliography{digiovanni_111}

\end{document}
