
%\PassOptionsToPackage{colorlinks=true,allcolors=black}{hyperref}
\PassOptionsToPackage{colorlinks=true,allcolors=gray}{hyperref}

%\documentclass{uai2023} % for initial submission
 \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

\usepackage{multicol}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{xspace}
\usepackage{siunitx}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
\usepackage[noend]{algpseudocode} % [OLIVIER] probably conflicts with algorithm2e
\usepackage[ruled,vlined,linesnumbered]{algorithm2e}
\newcommand\mycommfont[1]{\scriptsize\ttfamily\textcolor{blue}{#1}}
\SetCommentSty{mycommfont}
\SetKwProg{Fct}{Fct}{}{}
%% Self-defined macros
\usepackage{amsfonts}
\usepackage{amsthm,mathtools}
\allowdisplaybreaks
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}
%\newtheorem{proof}{Proof}
\newtheorem{definition}{Definition}
\newtheorem{example}{Example}
\newtheorem{problem}{Problem}
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\eqdef}     {\stackrel{{\textrm{\rm\tiny def}}}{=}}
\newcommand{\qdef}     {\stackrel{{\textrm{\rm\tiny ?}}}{=}}
\def\ie{{\em i.e.}\xspace}
\def\eg{{\em e.g.}\xspace}
\def\cf{{\em cf.}\xspace}
\def\wrt{{w.r.t.}\xspace}
\def\reals{{\mathbb R}}

% for tables 
\def\InfJesp{IJ}
\def\Rand{R}
\newcommand{\stdv}[1]{{\scriptstyle \pm #1}}
%%%

\def\cS{{\cal S}}
\def\cA{{\cal A}}
\def\cB{{\cal B}}
\def\cI{{\cal I}}
\def\cZ{{\cal Z}}
\def\cE{{\cal E}}
\def\va{{\mathbf a}}
\def\vo{{\mathbf o}}
\def\vn{{\mathbf n}}
\def\nNI{\#ni} % useful def ?
\def\fsc{\mathit{fsc}} % useful def ?
\def\FSC{\mathcal{FSC}} % useful def ?
\newcommand{\infJESP}[1][]{Inf-JESP{#1}\xspace}
\usepackage[noabbrev]{cleveref}
\DeclareRobustCommand{\abbrevcrefs}{%
\Crefname{appendix}{App.}{Apps.}%
\Crefname{section}{Sec.}{Secs.}%
\Crefname{equation}{Eq.}{Eqs.}%
\Crefname{figure}{Fig.}{Figs.}%
\Crefname{algorithm}{Alg.}{Algs.}%
\Crefname{tabular}{Tab.}{Tabs.}%
\Crefname{lemma}{Lem.}{Lems.}%
\Crefname{corollary}{Cor.}{Cors.}%
\Crefname{theorem}{Thm.}{Thms.}%
\Crefname{proposition}{Prop.}{Props.}%
\Crefname{line}{L.}{Ls.}%
%\Crefname{postulate}{Post.}{Posts.}%
%
\crefname{appendix}{app.}{apps.}%
\crefname{section}{sec.}{secs.}%
\crefname{equation}{eq.}{eqs.}%
\crefname{figure}{fig.}{figs.}%
\crefname{algorithm}{alg.}{algs.}%
\crefname{tabular}{tab.}{tabs.}%
\crefname{lemma}{lem.}{lems.}%
\crefname{corollary}{cor.}{cors.}%
\crefname{theorem}{thm.}{thms.}%
\crefname{proposition}{prop.}{props.}%
\crefname{line}{l.}{ls.}%
%\crefname{postulate}{post.}{posts.}%
}
\DeclareRobustCommand{\cshref}[1]{{\abbrevcrefs\cref{#1}}}
\DeclareRobustCommand{\Cshref}[1]{{\abbrevcrefs\Cref{#1}}}
\mathchardef\mhyphen="2D
\crefalias{AlgoLine}{line}%
\newcommand\linesrefAnd[2]{lines~\ref{#1} and \ref{#2}}
\newcommand\linesref[2]{lines~\ref{#1}--\ref{#2}}
\newcommand\Linesref[2]{Lines~\ref{#1}--\ref{#2}}

\usepackage[normalem]{ulem}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Commandes
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{xcolor}
\newcommand{\Vincent}[1]{\textcolor{blue}{\textbf{[VT]}#1\textbf{[/VT]}}}
\newcommand{\vincent}[1]{\Vincent{#1}}
\newcommand{\vincentReplace}[2]{\sout{#1} \Vincent{"#2"}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\Olivier}[1]{\textcolor{teal}{\textbf{[ob]} \em #1}}
\newcommand{\olivierlight}[1]{\textcolor{teal}{#1}}
\newcommand{\olivier}[1]{\Olivier{#1}}
\newcommand{\olivierReplace}[2]{\sout{#1} \Olivier{#2}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\Francis}[1]{\textcolor{orange}{\textbf{[fc]} #1 \textbf{[/fc]}}}
\newcommand{\francislight}[1]{\textcolor{orange}{#1}}
\newcommand{\francis}[1]{\Francis{#1}}
\newcommand{\francisReplace}[2]{\sout{#1} \Francis{#2}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Yang
\newcommand{\Yang}[1]{\textcolor{brown}{\textbf{[YY]}#1\textbf{[/YY]}}}
\newcommand{\yang}[1]{\Yang{#1}}
\newcommand{\YangReplace}[2]{\sout{#1} \Yang{"#2"}}

\renewcommand{\topfraction}{0.98}	% max fraction of floats at top
\renewcommand{\bottomfraction}{0.97}	% max fraction of floats at bottom
% Parameters for TEXT pages (not float pages):
\setcounter{topnumber}{2}
\setcounter{bottomnumber}{2}
\setcounter{totalnumber}{4}     % 2 may work better
\setcounter{dbltopnumber}{2}    % for 2-column pages
\renewcommand{\dbltopfraction}{0.97}	% fit big float above 2-col. text
\renewcommand{\textfraction}{0.01}	% allow minimal text w. figs
% Parameters for FLOAT pages (not text pages):
\renewcommand{\floatpagefraction}{0.97}	% require fuller float pages
% N.B.: floatpagefraction MUST be less than topfraction !!
\renewcommand{\dblfloatpagefraction}{0.97}	% require fuller float pages


\title{  Monte-Carlo Search for an Equilibrium in Dec-POMDPs}
% \sout{Solving Large and Infinite-Horizon Dec-POMDPs}
%   \\
%   \olivier{Simulation-based Dec-POMDP Planning (?) \\
%   Monte-Carlo Search for an Equilibrium in Dec-POMDPs (?) [\dots]}}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

\author[1]{Yang You}
\author[1]{Vincent Thomas}
\author[1]{Francis Colas}
\author[1]{Olivier Buffet}
% Add affiliations after the authors
\affil[1]{%
  Université de Lorraine, INRIA, CNRS, LORIA,  
  Nancy, France
}

%\affil[2]{%
 %   Second Affiliation\\
 %   Address\\
 %   …
%}
%\affil[3]{%
 %   Another Affiliation\\
  %  Address\\
 %   …
 % }

  \begin{document}
\maketitle

% \begin{abstract}
%   In this paper, we present a new algorithm for solving Dec-POMDPs called MC-JESP.
%   %
%   MC-JESP is a Monte Carlo extension of the previous work \infJESP which solves infinite-horizon Dec-POMDPs.
%   %
%   MC-JESP needs only a generative model of the Dec-POMDP compared with \infJESP, thus it enables to scale up to large problems.
%   %
%   Experiment with benchmarks shows that MC-JESP is competitive compared with exisiting algorithms for Dec-POMDPs, even better than many offline methods using explicit models.
% \end{abstract}

\begin{abstract}
  Decentralized partially observable Markov decision processes (Dec-POMDPs) formalize the problem of designing individual controllers for a group of collaborative agents under stochastic dynamics and partial observability.
  %
  Seeking a global optimum is difficult (NEXP complete), but seeking a Nash equilibrium ---each agent policy being a best response to the other agents--- is more accessible, and allowed addressing infinite-horizon problems with solutions in the form of finite state controllers.
  %
  In this paper, we show that this approach can be adapted to cases where only a generative model (a simulator) of the Dec-POMDP is available.
  %
  This requires relying on a simulation-based POMDP solver to construct an agent's FSC node by node.
  %
  A related process is used to heuristically derive initial FSCs. % that help converge to better Nash equilibria.
  %
  Experiment with benchmarks shows that MC-JESP is competitive with exisiting Dec-POMDP solvers, even better than many offline methods using explicit models.
\end{abstract}

\section{Introduction}\label{sec:intro}


% \olivier{Here are some notes to try to better motivate the approach (i.e., how I would present things, knowing that a number of points are already mentioned), underlining what may currently be missing:
%   \begin{itemize}
%   \item \uline{We would like to solve infinite-horizon Dec-POMDPs using a domain simulator.}
%   \item In a number of domains, searching for Nash equilibria is \uline{a good means to find near-globally optimal solutions} \cite{JESP,InfJESP}, while ``only'' relying on solving POMDPs.
%   \item This has been achieved for infinite-horizon problems using FSCs, which are a convenient representation for never-ending (/infinite-horizon) policies, but \uline{also can make for interpretable policies if their size is limited/reasonable}  \cite{Amato2007UAI,KumMosZil-icaps16,InfJESP,???}.
%   \item Here, we show how to derive FSC solutions by searching for Nash equilibria using simulation-based planning methods.
%   \item This requires in particular being able to build an agent's FSC using a Monte-Carlo POMDP planner.
%   \end{itemize}
%   }

%\sout{The decentralized Partially Observable Markov Decision Process (Dec-POMDP) is a standard framework for collaborative tasks with multiple agents.}
%
The framework of Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) allow modeling collaborative multi-agent systems, the objective being to equip them with individual policies that maximize some common performance criterion.
%
% In Dec-POMDPs, each agent only receives local observations, and the goal is to derive a joint policy for all agents that maximize the expected return.
%
However, solving Dec-POMDPs is challenging since the environment evolves according to all agent's actions,
%
and each agent performs its action only based on its local action-observation histories.
%
To ensure finding a global optimum, one thus needs to reason about all individual policies together.
%
As a consequence of this interdependency, even for a finite-horizon Dec-POMDP, the solving process has been proven to be NEXP in the worst case \citep{Bernstein02}, and solving an infinite-horizon Dec-POMDP is undecidable \citep{MADANI20035, 10.5555/2967142}.
%

\citeauthor{JESP} propose an alternative approach called JESP (Joint Equilibrium-Based Search for Policies) \citep{JESP},  which avoids this interdependency in the solving process by searching for a Nash equilibrium, \ie, each agent's policy is a best response to other agents' policies.
%
JESP operates an iterative optimization process over each agent.
%
In each iteration, it builds agent $i$'s best-response policy considering other agents' policies are fixed.
%
%\olivier{Don't start all your sentence with a connective. Move connectives around. $\to$}
%
A Nash equilibrium is therefore guaranteed when no further improvement is possible.
%
In JESP, each agent's policy is represented in a tree structure, which limits its usage only to finite-horizon problems.
%
To overcome this limitation, \infJESP (Infinite-Horizon JESP) \citep{InfJESP} extends JESP to infinite-horizon Dec-POMDPs by representing each agent policy as a finite-state controller (FSC).
%
Two advantages of \infJESP are that
\begin{enumerate*}
\item it often achieves near-global optima despite only searching for local ones, and
\item its FSCs make for interpretable policies if their size is reasonable.
\end{enumerate*}
%
However, both methods require an explicit Dec-POMDP model which details the exact environment dynamics.
%

In this paper, we propose a new algorithm called MC-JESP (Monte-Carlo Joint Equilibrium-based Search for Policies),
% which corresponds to the third type of method that finds Nash equilibrium solutions.
%
which is a simulation-based version of \infJESP.
%
%\sout{It computes each agent's best-response FSC considering other agents' fixed policies and iterates until convergence.}
%\olivier{$\gets$ This is nothing new.}
%
In each iteration, MC-JESP builds an agent's FSC node by node using a Monte-Carlo (POMDP) planner relying on a black-box Dec-POMDP simulator, along with the other agents' FSCs.
%
Experiments show that MC-JESP is competitive with state-of-the-art infinite-horizon Dec-POMDP solvers based either on exact or generative models.
%even compared with FB-HSVI \citep{DibAmaBufCha-jair16} (which gives $\epsilon$-optimal solutions using explicit models).
%

The structure of this paper is organized as follows:
%
\Cref{sec:related_work} discusses related work on solving Dec-POMDPs.
%
\Cshref{sec:background} gives background about Dec-POMDPs, POMDPs, FSCs, and \infJESP.
%
Our contributions are presented in \Cshref{sec:MCJESP}, and experiments with comparisons to state-of-the-art Dec-POMDP solvers in \Cshref{sec:exp}.
%
Finally, we conclude this work in \Cshref{sec:conclusion} and discuss future perspectives.



\section{Related Work}
\label{sec:related_work}

Recently, there has been significant progress in infinite-horizon Dec-POMDP planning, and state-of-the-art methods fall into three main types.
%
The first type of methods estimates the best parameters of finite-state controllers (FSCs) of each agent \citep{AmaBerZil-jaamas10}, and addresses Dec-POMDPs as an inference problem via
Expectation-Maximization methods \citep{PajPel-ijcai11,PajPel-nips11, kumar2012, KumZilTou-jair15}.
%
The second type consists in transforming the Dec-POMDP problem into a Markov decision process with a state space of sufficient statistics \citep{MacIsb-nips13, dibangoye2014error, DibAmaBufCha-jair16}.
%
The third type searches for Nash equilibrium solutions, \ie, each agent's policy being a best response to the other agents' policies \citep{JESP,BerHanZil-ijcai05, InfJESP}.
%
% \vincent{added Nair et al.}

However, for large problems or real applications,  it may be challenging to represent the system's dynamics explicitly.
%
Often, only a black-box simulator (also called a generative model) is available.
%
%Therefore, the planning methods mentioned previously cannot be directly applied since an explicit Dec-POMDP model is required.
%
Although the algorithms mentioned previously with explicit models cannot be directly applied, most state-of-the-art simulation-based methods are inspired by them.
%
For example,
%
\citet{Wu2013} propose to use a Monte-Carlo Expectation Maximization (MCEM) for estimating the parameters of agents' FSCs with generative models.
%
\citet{Liu2015} improve MCEM by constructing agent FSCs using the stick-breaking prior and allowing a variable FSC size.
%\olivier{You can talk about the ``number of nodes of the FSC'' or the ``size of FSC (/the FSC size)'', but {\bf not} the ``FSC node size''.}
%
% \vincent{stick-breaking ? to be verified. $\to$ checked, seems ok !}
%
On the other hand, similar to FB-HSVI \citep{DibAmaBufCha-jair16} (which uses explicit models), the simulation-based method oSARSA \citep{pmlr-v80-dibangoye18a} focuses on recasting Dec-POMDPs into occupancy-state MDPs, where each occupancy-state is a sufficient statistics.



Last but not least, some multi-agent reinforcement learning (MARL) algorithms are also interested in solving Dec-POMDPs with black-box simulators.
%
However, most of them \citep{VDN2017, rashid2018qmix, son2019qtran, rashid2020weighted} %\sout{avoided testing on well-known Dec-POMDP benchmarks \mbox{\citep{Seuken2007ImprovedMD, amato2009achieving}} in order to compare with the existing optimal or sub-optimal planning methods.}
have not been evaluated on classical Dec-POMDP benchmarks \citep{Seuken2007ImprovedMD, amato2009achieving}.
%
Only a few MARL algorithms conducted experiments on such domains but were limited to finite-horizon settings \citep{lee2019improved}, or failed to obtain state-of-the-art results \citep{KRAEMER201682}.
%


\section{Background}
\label{sec:background}

\subsection{Dec-POMDP}


%\olivier{One sentence to introduce Dec-POMDPs (what they allow modeling).}

The problem of finding optimal collaborative behaviors for a group of
agents under stochastic dynamics and partial observability is
typically formalized as a {\em decentralized partially observable
  Markov decision process} (Dec-POMDP).


\begin{definition}
  A {\em Dec-POMDP} with $|\cI|$ agents is represented as a tuple $M \equiv \langle \cI, \cS, \cA, \Omega, T, O, R, b_0, H, \gamma \rangle$, where:
  % \begin{itemize}
  % \item
  $\cI = \{1, \dots, |\cI|\}$ is a finite set of {\em agents};
  % \item
  $\cS$ is a finite set of {\em states};
  % \item
  $\cA = \bigtimes_i \cA^i$ is the finite set of joint actions, with %
  $\cA^i$ the set of agent $i$'s {\em actions}; %
  % \item
  $\Omega = \bigtimes_i \Omega^i$ is the finite set of joint
  observations, with %
  $\Omega^i$ the set of agent $i$'s {\em observations}; %
  % \item
  $T: \cS \times \cA \times \cS \to \reals$ is the {\em transition
    function}, with %
  $T(s,\va,s')$ the probability of transiting from $s$ to $s'$ if $\va$ is
  performed;
  % \item
  $O: \cA \times \cS \times \Omega \to \reals$ is the
  {\em observation function}, with %
  $O(\va,s',\vo)$ the probability of observing $\vo$ if $\va$ is performed and
  the next state is $s'$;
  % \item
  $R: \cS \times \cA \to \mathbb{R}$ is the {\em reward function},
  with %
  $R(s,\va)$ the immediate reward for executing $\va$ in $s$;
  % \item
  $b_0$ is the {\em initial probability distribution} over states;
  % \item
  $H \in \mathbb{N} \cup \{\infty\}$ is the (possibly infinite) {\em time horizon};
  % \item
  $\gamma \in [0,1)$ is the {\em discount factor} applied to future rewards.
  % \end{itemize}
\end{definition}

An agent's $i$ action {\em policy} $\pi^i$ maps its possible
action-observation histories to actions.
%
The objective is then to find a joint policy
$\pi=\langle \pi^1, \dots, \pi^{|\cI|} \rangle$ that maximizes the expected discounted return from $b_0$:
\begin{align*}
  V^{\pi}_H(b_0)
  & \eqdef \mathbb{E}\left[ \sum_{t=0}^{H-1} \gamma^{-t} r(S_t, A_t) \mid S_0 \sim b_0, \pi \right].
\end{align*}

However, we often do not know the exact transition, observation, and reward functions for large problems or real-world applications,
%
% In this case, Dec-POMDP solvers usually rely on a generative model (black-box simulator) $G$:
% %
%  \begin{align*}
% 	s',\vo, r \gets G(s, \va).
% \end{align*}
% %
% The inputs of $G$ are the current state $s$ and joint action $\va$ of all agents.
% %
% Then, it outputs the next state $s'$, the observations $\vo$ for all agents, and the instant reward $r$.
% %
% %
but may rely on a generative model (black-box simulator) $G$, which, given a state-action pair $\langle s,\va \rangle$, samples a triplet $\langle s',\vo,r \rangle$.
%

\subsection{POMDP}
%
%In this work, we solve Dec-POMDPs by iteratively computing each agent's best-response policy given the other agents' fixed policies,
%
%what corresponds to solving a single-agent partially observable Markov decision problem (POMDP) in each iteration, \ie, the particular case of a single-agent Dec-POMDP ($\cI=\{1\}$), which will allow us dropping the agent index $i$.
%
%
In this work, we will consider one agent $i$ at a time,
%
% \vincent{added 'at a time'}
%
and thus end up solving a single-agent partially observable Markov decision problem (POMDP) in each iteration, \ie, the particular case of a single-agent Dec-POMDP ($\cI=\{1\}$).
% , which will allow us dropping the agent index $i$.
%
In a POMDP, an optimal policy $\pi^*$ exists whose input is the belief state
$b$, \ie, the probability distribution over states given the current action-observation history.
%
For finite $H$, the optimal value function (which allows deriving
$\pi^*$) is recursively defined as:
\begin{align*}
  V^*_h(b)
  & = \max_a \left[r(b, a) + \gamma \sum_{o} Pr(o \mid b, a) V^*_{h-1}(b^{a,o}) \right],
\end{align*}
where %
(i) $r(b,a) = \sum_s b(s)\cdot r(s,a)$; %
(ii) $Pr(o \mid b, a)$ depends on the dynamics; and %
(iii) $b^{a,o}$ is the belief updated upon performing $a$ and
perceiving $o$.
%
% \sout{
% For finite $H$, $V^*_H$ is known to be piece-wise linear and convex
% (PWLC) in $b$.
% %
% For infinite $H$, $V^* (=V^*_\infty)$ can be
% approximated by an upper envelope of hyperplanes---called
% $\alpha$-vectors $\alpha \in \Gamma$.
% }
% \olivier{$\gets$ Do we really need to talk about $\alpha$-vectors in this paper?!
%   See next subsection.}
%

\subsection{Finite State Controllers}
\label{sec:FSC_def}



In POMDPs as in Dec-POMDPs, solution policies can also be sought for in
the form of {\em finite state controllers} (FSC) (also called {\em
  policy graphs} \citep{MeuKimKaeCas-uai99}), \ie, automata whose
transitions from one internal state to the next depend on the received
observations and generate the actions to be performed.
%

\begin{definition}
  For some POMDP sets $\cA$ and $\Omega$, %
  a (deterministic) {\em FSC} is specified by a tuple
  $\fsc \equiv \langle N, \eta, \psi \rangle$, where:
  \begin{itemize}
  \item $N$ is a finite set of nodes, %
    with $n_0$ the start node; %
  % \item \sout{$\eta: N \times \Omega \times N \to \reals$ is the node transition function; %
  %   $\eta(n,o,n')$ is the probability of moving from node $n$ to $n'$ if $o'$ is observed; %
  %   the notation $n'=\eta(n,o)$ is also used when this transition is deterministic;}
  \item $\eta: N \times \Omega \to N $ is the node transition function; %
    $n'=\eta(n,o)$ is the next node and observing $o$ from node $n$;
  % \item \sout{$\psi: N \times \cA \to \reals$ is the action-selection function of the FSC; %
  %   $\psi(n,a)$ is the probability to choose action $a \in \cA$ in node $n$; %
  %   the notation $a=\psi(n)$ is also used when this function is deterministic.}
  \item $\psi: N \to \cA $ is the action-selection function of the FSC; %
    $a=\psi(n)$ is the action triggered when in node $n$.
  \end{itemize}
\end{definition}

% \olivier{The reminder of this subsection allows explaining (later on) how we can accurately evaluate our solution FSCs using the exact model (which we have in the experiments).
%   %
%   I would remove this part, as well as anything about $\alpha$-vectors in sections talking about contributions, but explain in the experiments that, because we know the exact models of our benchmark problems, we can return exact evaluations of our solution FSCs.
% }

% \sout{
% An FSC's value function is the solution of the
% following system of linear equations, with one $\alpha$-vector per
% node $n$ \mbox{\citep{Hansen-nips97}}:
% }
% %
% \begin{align*}
%   \label{eqn:PolicyEvaluation}
%   \alpha_{s}^{n} & = R(s, a_n)+\gamma \sum_{s', o}T(s, a_n, s')O(a_n,s',o) \alpha_{s'}^{\eta(n, o)},
% \end{align*}
% %
% \sout{
% where $a_n\eqdef \psi(n)$.
% %
% A value estimation solution can be found using the fixed point theorem,
% %
% \ie, an iterative process that stopped when the Bellman residual (the
% largest change in value) is less than a threshold $\epsilon$, so that
% the estimation error is less than $ \epsilon \over {1-\gamma}$.
% }

\subsection{Solving Dec-POMDPs by finding Nash Equilibria (Infinite-Horizon JESP)}
\label{sec:infJESP}

%This work aims to solve Dec-POMDPs by searching for Nash equilibrium solutions.
%
%Specifically, we extend the JESP (Joint Equilibrium based Search for Policies) scheme \citep{JESP} to infinite-horizon and generative model settings.
%%
%In JESP, each agent's policy is represented using a policy tree, and JESP iterates over each agent and tries to improve agent $i$'s policy, considering that other agents' policies are fixed.
%
%Therefore, a Nash equilibrium solution is obtained when no possible improvement can be made using JESP.
%
%However, JESP is limited to solving finite-horizon Dec-POMDPs since it uses policy tree representations.
%
%To that end, \infJESP (Infinite-Horizon JESP) \citep{InfJESP} extends JESP to infinite horizons by using finite-state controller (FSC) representations for each agent's policy.
\infJESP (Infinite-Horizon JESP) \citep{InfJESP} is an infinite-horizon Dec-POMDP solver, which is based on \citeauthor{JESP}'s JESP [\citeyear{JESP}], but replaces the policy tree representation by a finite-state controller (FSC) for each agent's policy.
%
This modification allows solving infinite-horizon problems rather than finite-horizon ones, and %
may help scaling up to larger problems.
%
%\sout{More specifically, in \infJESP, each iteration of optimization will formalize a best-response POMDP for agent $i$ with an extended state space $e^t \in \cE$, \ie, containing:}
%
More specifically, in \infJESP (\Cref{alg:JESP_main}), each iteration derives (\cref{codes:BuildModel_infJESP}) the explicit model of a (best-response) POMDP for agent $i$ by fixing the other agents' FSCs (index ``$-i$'') and using an extended state space $e^t \in \cE$, \ie, containing:
%
\begin{itemize}
 \item $s^t$, the current state of the Dec-POMDP problem,
 \item $\vn_{-i}^{t} \equiv \langle n_j^{t} \rangle_{j \neq i}$, the current nodes of other agents, and
 \item $\tilde o_{i}^{t} $, agent $i$'s current observation.
\end{itemize}
%
Denoting $\eta_{-i}(\vn_{-i}^{t}, \vo^{t+1}_{-i}) = \langle \eta(n_j^{t}, \tilde o_j^{t+1}) \rangle_{j \neq i} $ and $\psi_{-i}(\vn_{-i}^{t}) = \langle \psi_j(n_j^{t}) \rangle_{j \neq i} $,
this leads to a valid POMDP with the following dynamics:\footnote{Note: \citet{InfJESP} provide formulas for stochastic FSCs.}
%
\begin{align*}
  &  T_e(e^t, a^t_i, e^{t+1})  = Pr(e^{t+1}| e^t, a^t_i) \\
  & \quad = \sum_{\vo^{t+1}_{-i}}T(s^t, \langle  \psi_{-i}(\vn_{-i}^{t}), a^t_{i} \rangle, s^{t+1})
  \cdot \mathbf{1}_{ \vn_{-i}^{t+1} = \eta_{-i}(\vn_{-i}^{t}, \vo^{t+1}_{-i}) } \\
  & \qquad \cdot O(s^{t+1}, \langle \psi_{-i}(\vn_{-i}^{t}), a^t_{i} \rangle, \langle \vo^{t+1}_{-i}, o^{t+1}_{i} \rangle),
  \\ %
  & O_e(a^t_i, e^{t+1}_i, o^{t+1}_i)
  = Pr( o^{t+1}_i | a^t_i, e^{t+1}_i) \\
  & \quad = Pr( o^{t+1}_i | a^t_i, \langle s^{t+1}, \vn^{t+1}_{-i}, \tilde o^{t+1}_i \rangle) %
  = \mathbf{1}_{o^{t+1}_i = \tilde  o^{t+1}_i},
  \\
  & r_e(e^t, a^t_i) =  r(s^t, a^t_i, \psi_{-i}(\vn_{-i}^{t}) ).
\end{align*}
%
%
Then, \infJESP solves this explicit POMDP for agent $i$ using an $\epsilon$-optimal offline POMDP solver (SARSOP \citep{sarsop}) and derives an FSC $\fsc'_i$ that approximates the solution policy (cf. \cref{codes:getFSC}, which does not distinguish both steps).
%
$\fsc_i'$ is then evaluated (\cref{codes:PolicyEval_infJESP}) and retained only if it improves on $i$'s previous FSC, $\fsc_i$, so that \infJESP stops when an approximate Nash equilibrium is obtained, which is detected using a counter $\nNI$.


\begin{algorithm}
  \caption{Inf-JESP's Local Search}
  \label{alg:JESP_main}
  %
  \DontPrintSemicolon
  %
  \SetInd{.3em}{.6em}
 %\scalefont{.9}

  \SetKwFunction{LocalSearch}{{\bf LocalSearch}}
  \SetKwFunction{ComputeFSC}{{\bf ComputeFSC}}
  \SetKwFunction{getBRpomdpModel}{{\bf getBRpomdp}}
  \SetKwFunction{Eval}{{\bf Eval}}

  [Input:] %
  $b^0$: initial belief $\mid$ %
  $M$: Dec-POMDP model $\mid$ %
  \linebreak %
  $\fsc$: initial FSCs \; %
   % $\epsilon$: error gap $\mid$ %
  \Fct{\LocalSearch{$b_0, M, \fsc$} }{
    $v_{bestL} \gets eval(\fsc)$ \;
    $\nNI \gets 0$   \tcp*[h]{\#(iterations w/o improvement)} \;
    $i\gets 1$  \tcp*[h]{Id of current agent} \;
    \Repeat( \tcp*[h]{Cycle over agents} ){
      $\nNI=|\cI|$ \tcp*[h]{No improvement in last cycle.}
    }{
      %\uwave{$G_{BR}, b^0_{BR} \gets \BuildBRGenerativeModel(G, b^0, \fsc_{-i})$} \label{codes:BuildExtendedG_MCJESP} \;
      %\uwave{$\fsc'_i \gets $\ComputeFSC{$b^0_{BR}$ $\mid$ $G_{BR}$, $K$ }} \label{codes:learnFSC} \;
      %\uwave{$v \gets \EvalSim(\fsc'_i,  b^0_{BR} \mid  G_{BR})$} \label{codes:PolicyEval_MCJESP} \;
      $b^0_{BR}, M_{BR} \gets \getBRpomdpModel(b^0, M, \fsc_{-i})$ \label{codes:BuildModel_infJESP} \;
      $\fsc'_i \gets $\ComputeFSC{$b^0_{BR}, M_{BR}$ } \label{codes:getFSC} \;
      $v \gets \Eval(\fsc'_i,  b^0_{BR},  M_{BR})$ \label{codes:PolicyEval_infJESP} \;
      \uIf(\tcp*[h]{Keep new FSC if better}){$v > v_{bestL}$}{
        $\fsc_i \gets \fsc'_i$\;
        $v_{bestL} \gets v$\;
        $\nNI \gets 0$\;
      }
      \Else(\tcp*[h]{increment $\nNI$}){
        $\nNI \gets \nNI+1$\;
      }
      % $i \gets (i+1) \mod |\cI|$\;
      $i \gets (i \mod |\cI|) + 1$\;
    }%({\scriptsize \hfill \tt{// looped over agents w/o improving}})
    \Return{$\langle \fsc, v_{bestL} \rangle$}
  }
\end{algorithm}

In practice (see \citep{InfJESP} or \Cshref{sec:results}), this search for an equilibrium often allows finding near-global optima either using some random restarts, or initial FSCs obtained through relaxing the Dec-POMDP.
%


\section{Monte-Carlo JESP}
\label{sec:MCJESP}

%MC-JESP finds Nash equilibrium solutions through an iterative optimization process, whose main algorithm is presented in \Cref{sec:Main_Algo}.
%
%In each iteration, MC-JESP constructs a \textit{best-response generative model} for agent $i$ using the Dec-POMDP simulator and other agents' policies (see \Cref{sec:BR_generative_model}).
%
%Then, \Cref{sec:Compute_FSC} describes how to build agent $i$'s FSC policy with its best-response model.
%
%A heuristic initialization method for MC-JESP is provided in \Cref{sec:MCJESP_init}.

As \infJESP, we aim to find Nash equilibrium infinite-horizon solutions by iteratively building each agent's best response to other agents' fixed policies.
%%
We thus stick to representing policies as FSCs, and to using the same algorithmic scheme for the local search as presented in \Cshref{alg:JESP_main}.

This requires relying on the same best-response POMDPs, \ie, in particular with the same extended state $s^t_{e} = \langle s^t, \vn_{-i}^{t}, o_{i}^{t}  \rangle$.
%
However, lacking an explicit Dec-POMDP model, we cannot derive explicit POMDP models.
%
%A challenge for MC-JESP is how to build such a best response model if only a black-box Dec-POMDP simulator is available.
%
To address this issue, in MC-JESP, we propose an alternative approach that relies on \textit{best-response generative POMDP models}  (noted $G_{BR}$) derived from the Dec-POMDP simulator and other agents' FSCs.
%
% In the following content, we call this model ``$G_{BR}$'' for simplicity.
In the following, we discuss %
how to build such models, how to derive solution FSCs, %
how to obtain initial heuristic FSCs, and %
what are the
properties of the resulting approach.
% \vincent{added 'what are the'.}

\subsection{Best-Response Generative Model}
\label{sec:BR_generative_model}

A generative POMDP model $G_{BR}$ for agent $i$ has to sample the next extended state $s^{t+1}_e$, observation $o^{t+1}_i$, and reward $r^{t+1}$, given a current extended state $s^t_e$ and action $a^t_i$.
%
As illustrated in \Cref{fig:BRGenerativeModel} and as detailed in \Cshref{alg:MCJESP_Extended_G}, this can be achieved by relying only on the Dec-POMDP simulator and the other agents' FSCs.
%
The algorithm first decomposes the extended state, and gets other agents' actions $\va^t_{-i}$ according to action-selection functions $\psi_{-i} \equiv \langle \psi_j \rangle_{j \neq i}$  (\cref{line:decompose_extended_state,line:GetOthersAction}).
%
Then, in \cref{line:MC_JESP_step_of_G}, the joint action $ \langle a^t_{i}, \va^t_{-i} \rangle$ is passed to the Dec-POMDP simulator $G$, which outputs the next state $s^{t+1}$, joint observation $\langle o^{t+1}_i, \vo^{t+1}_{-i}  \rangle$ and instant reward $r^{t+1}$.
%
With the other agents' observations $\vo^{t+1}_{-i}$, \cref{line: GetOthersNextNode} computes their next nodes $\vn^{t+1}_{-i} \equiv \langle n^{t+1}_j \rangle_{j \neq i}$.
%
In the end, we build the next extended state $s^{t+1}_e$ and return the results (\cref{line:built_next_ste,line:MCJESP_return}).
%
In this algorithm, stochasticity exists only in the Dec-POMDP simulator $G$, the FSC functions $\psi$ and $\eta$ being deterministic.

%\olivier{Point out what is stochastic (only the DecPOMDP model?) vs deterministic (both functions of the FSC models?) ?}

\begin{figure}%[H]
     \centering
     \includegraphics[width=1.0 \linewidth]{images/MC_JESP_BestResponseG.pdf}
%
     \caption{Structure of the best-response POMDP generative model $G_{BR}$,
       %
       with inputs and outputs represented as:
       %
       blue arrows for the Dec-POMDP simulator $G$;
       %
       green arrows for agents $-i$' FSCs; and
       %
       black arrows for $G_{BR}$.
       %
     }
  \label{fig:BRGenerativeModel}
%  \yang{too small fonts, extra margins, arrange the structure of the figure!!}
\end{figure}



\newcommand{\tmp}[1]{
%  \makebox[0pt][l]{#1}\phantom{$s^{t+1}, \vo^{t+1}, r^{t+1}$}
  \phantom{$\langle s^t, \vn^t_{-i}, o^t_i \rangle$}\makebox[0pt][r]{#1}
}

\begin{algorithm}
  \caption{$i$'s Best-Response Generative Model} \label{alg:MCJESP_Extended_G}
  \DontPrintSemicolon
  %
  \SetInd{.3em}{.6em}

%  \SetKwFunction{ProcessAction}{{\bf doAction}}
  \SetKwFunction{ProcessAction}{{\bf getNxtBeliefs}}


  {[Input:]} %
  $s^t_e$: extended state $\mid$ %
  $a^t_i$: agent $i$'s action \; %$\mid$ \linebreak %
  {[Parameters:]} %
  $G$: Dec-POMDP simulator $\mid$ \linebreak %
  % $\langle  N_{-i}, \psi_{-i}, \eta_{-i} \rangle$: other agents' FSCs \\
  $\fsc_{-i} \equiv \langle  N_{-i}, \psi_{-i}, \eta_{-i} \rangle$: other agents' FSCs \\
  % $ fsc_{\neq_i} \equiv \langle  fsc_j \rangle_{j \neq i}$: other agents' FSCs \\
  % \Fct{\bf $G_{BR}(s^t_e,a^t_i,G, \fsc_{\neq_i})$}{
  %   $\langle s^t, \vn^t_{-i}, o^t_i \rangle \gets s^t_{e} $ \label{line:decompose_extended_state} \;
  %   $\va^t_{-i} \gets \psi_{-i}(\vn^t_{-i})$   \label{line:GetOthersAction} \;
  %   $s^{t+1}, \vo^{t+1}, r^{t+1} \gets G(s^t, \va^t )$ \label{line:MC_JESP_step_of_G}  \;
  %   $\vn^{t+1}_{-i} \gets \eta(\vn^t_{-i}, \vo^{t+1}_{-i})$ \label{line: GetOthersNextNode} \;
  %   $s^{t+1}_{e} \gets \langle s^{t+1}, \vn^{t+1}_{-i}, o^{t+1}_i  \rangle$ \label{line:built_next_ste}  \;
  %   \textbf{return} $s^{t+1}_{e}, o^{t+1}_i , r^{t+1}$ \label{line:MCJESP_return} \Comment{return step results}
  % }
  \Fct{\bf $G_{BR}(s^t_e,a^t_i, [G, \fsc_{-i}])$}{
    \tmp{$\langle s^t, \vn^t_{-i}, o^t_i \rangle$} $\gets s^t_{e} $ \label{line:decompose_extended_state}
    \tcp*[f]{extract $s^t_e$'s 3 components} \;
    \tmp{$\va^t_{-i}$} $\gets \psi_{-i}(\vn^t_{-i})$   \label{line:GetOthersAction}
    \tcp*[f]{get action from FSC} \;
    $s^{t+1}, \vo^{t+1}, r^{t+1} \gets G(s^t, \va^t )$ \label{line:MC_JESP_step_of_G}
    \tcp*[f]{sample transition} \;
    \tmp{$\vn^{t+1}_{-i}$} $\gets \eta(\vn^t_{-i}, \vo^{t+1}_{-i})$ \label{line: GetOthersNextNode}
    \tcp*[f]{evolve FSC} \;
    \tmp{$s^{t+1}_{e}$} $\gets \langle s^{t+1}, \vn^{t+1}_{-i}, o^{t+1}_i  \rangle$ \label{line:built_next_ste}
    \tcp*[f]{build $s^{t+1}_e$} \;
    \textbf{return} $s^{t+1}_{e}, o^{t+1}_i , r^{t+1}$ \label{line:MCJESP_return} \tcp*[f]{return step results}
  }
\end{algorithm}



\subsection{Computing Agent $i$'s FSC using Monte-Carlo methods}
\label{sec:Compute_FSC}

In the previous section, we demonstrate how to build the best-response generative model $G_{BR}$ for agent $i$ considering others' fixed FSCs.
%
However, unlike in \infJESP, state-of-the-art point-based POMDP solvers (see, \eg, \citep{PBVI,Smith_2004_HSVI,sarsop}) require exact models, and thus can not be used in MC-JESP.
%
%\footnote{Point-based solvers \citep{PBVI,Smith_2004_HSVI,sarsop, bai2010monte} approximate a POMDP's value function by reachable belief points. However, they also suffer that the number of needed sampled points grows exponentially with the increase of state size.}
%
%Monte Carlo Value Iteration (MCVI) \citep{bai2010monte} is a point-based solver using generative models, but it is unable to bound the solution size for solving large POMDPs.
%
We thus rely on a simulation-based solver, \ie, POMCP \citep{NIPS2010_edfbe1af}, which is an online algorithm, \ie, it focuses on returning the best action for the current belief.
%
Therefore, the question is how to use a simulator ($G_{BR}$) and an online simulation-based solver to obtain agent $i$'s FSC.
%
To answer it, we propose an algorithm that
%
\begin{enumerate*}
\item uses this Monte-Carlo planner (POMCP) to compute the best action for a  given FSC node, which is labeled by a unique belief; and
\item expands reachable beliefs, \ie, creates new FSC nodes using computed actions to gradually build a complete FSC.
\end{enumerate*}
% use the Monte-Carlo approach (POMCP) to compute the best action at each belief with $G_{BR-POMDP}$ and insert them into an FSC one by one.
%
Moreover, to control the computational cost, we explicitly bound the FSC size with a given parameter $N_{max\mhyphen\fsc} \in \mathbb{N}$.


In the proposed algorithm, each FSC node is attached to
\begin{enumerate*}
\item an approximate belief $b$ (with at least $N_{min\mhyphen part}$ particles),
\item a preferred action $a_i$, and
\item a weight $w$ that estimates the probability to reach that node at least once during execution.
\end{enumerate*}
%
%
% \vincent{maybe add a sentence to explain what is the aim of weights $w$ (not just a definition).
% %
% Something like this at the end of 3rd item ?
% ', so that nodes with the highest impact on the value function will be selected and developped first'
% or
% 'These weights will help to guide the search and
% to develop first the nodes that have the highest impact on the value function.'
% ?
% }
%
As detailed in \Cref{alg:MCJESP_BuildFSC}, this information is first gathered for initial belief $b_0$ by calling
POMCP to get agent $i$'s best action $a^0_i$ in \cref{alg:MCJESP_initial_best_action}. %
%then \ProcessAction, which samples many transitions until each feasible observation (according to the samples) is attached to a sufficient number of particles $N_{min\mhyphen part}$ (\cref{alg:MCJESP_ProcessActionStop}),
%
%and returns the set of feasible observations $\Omega^{t+1}_i$ and the set of induced beliefs $B^{t+1}_i$.
%(each belief is a collection of particles).
%
\Cref{alg:MCJESP_start_node} then creates a start node $n_0$ with $b^0_{BR}$, $a^0_i$, and a weight $w = 1$.
%
This start node is added to the FSC under construction $(N)$ and an open list $(L)$.
%

Now, while $L$ is not empty, its node $n$ with largest weight $w$ is popped out (\cref{alg:MCJESP_PopNode}), so as to first develop the nodes that may have the highest impact on the value at the root.
%
Expanding it requires mapping each observation $o_i$ feasible when performing $n.a_i$ from $n.b$ to a particle set.
%
This is achieved through sampling by \ProcessAction
until each feasible $o_i$ (according to the samples) is attached to at least $N_{min\mhyphen part}$ particles (\cref{alg:MCJESP_ProcessActionStop}),
%
which returns a set $\Omega'_i$ of feasible observations, and a mapping $B'_i$ from these observations to particle sets.
%(each belief is a collection of particles).
%
Then, for each individual observation $o_i$, the algorithm needs to create a transition to an appropriate node, which may already exist
%
% \vincent{I replaced \sout{exists} by exist}
%
or needs to be created, as explained in the following.
%
If $o_i$ is assumed impossible when performing $n.a_i$ in $n.b$ ($o_i \not\in n.\Omega_i$), then a self-loop is added (\cref{alg:MCJESP_selfloop}).\footnote{Note that $o_i$ could become feasible due to future changes in other agents' FSCs.}
%
%
Otherwise, %
\cref{alg:MCJESP_BeliefUpdate} gets the belief $b'_{BR}$ attached to $o_i$, and %
\cref{alg:MCJESP_NewWeight} computes an associated weight $w'$. %
% \cref{alg:MCJESP_NewBestAction} computes agent $i$'s best action $a'_i$ given its new belief $b'_{BR}$ using POMCP.
%
If %
(i) a belief $\epsilon$-close to $b'_{BR}$ (in 1-norm) exists in $N$, or %
(ii) the FSC has reached its size limit $N_{max\mhyphen\fsc}$ (\cref{alg:MCJESP_CheckNewNode}), %
then we take as next node $n'$ the one in the FSC minimizing $\|b'_{BR} - n'.b\|_1$ and update its weight (\linesref{alg:MCJESP_FindClosestNode}{alg:MCJESP_UpdateClosestNodeWeight}).
%
Otherwise a next node $n'$ is created using an action selected by POMCP (\linesref{alg:MCJESP_NewBestAction}{alg:MCJESP_BuildNewNode}), and %
added to both $N$ and $L$ (\linesref{alg:MCJESP_BuildNewNode}{alg:MCJESP_addNewNodeToG}).
%
In \cref{alg:MCJESP_transitionNewNode}, whatever the origin of $n'$, an edge $n \to n'$ is created in the FSC with a label $o_i$.

Note that, for a fixed $N_{max\mhyphen\fsc}$ value, a small $\epsilon$ may prevent from representing long trajectories, while a large $\epsilon$ may induce excessive node merging.





\begin{algorithm}
  \caption{Compute agent $i$'s FSC}
  \label{alg:MCJESP_BuildFSC}
  %
  \DontPrintSemicolon
  \SetInd{.3em}{.6em}

  {[Input:]} %
  $b^0_{BR}$: $G_{BR}$'s initial (extended) belief state $\mid$ %
  \linebreak %
  $G_{BR}$: best response generative model for agent $i$ \; %
  {[Parameters:]}  %
  $N_{max\mhyphen\fsc}$: max FSC size for agent $i$  $ \mid$ %
  \linebreak %
  $N_{min\mhyphen part}$: min number of particles in each belief $\mid$ %
  \linebreak %
  $\epsilon$: min. distance between beliefs  \;
  \Fct{\bf ComputeFSC($b^0_{BR}, G_{BR}$)}{ % $N_{max\mhyphen\fsc}, N_{min\mhyphen part}$)}{
  $a^0_i \gets POMCP(b^0_{BR}, G_{BR})$ \label{alg:MCJESP_initial_best_action} \;
  $n_0 \gets node(b^0_{BR}, a^0_i, w=1)$   \label{alg:MCJESP_start_node} \;
  % $n_0.w \gets 1 $ \;
  $N \gets  \{n_0\}$  \tcp*[f]{init FSC \& open list} \;
  $L[w] \gets n_0 $  \tcp*[f]{(sorted by $\searrow$ weight)}\;
  \While{$|L| > 0$ }{
    % $L.sort()$ \label{alg:MCJESP_sortG} \;
    $n \gets L.popfront()$  \label{alg:MCJESP_PopNode}  \;
   % $\langle b_{BR}, a_i \rangle \gets n $ \;
    $\Omega'_i, B'_i  \gets \ProcessAction(n.b, G_{BR}, n.a_i)$ \;
    \For(\tcp*[h]{For each obs. of $i$:}){$ o_i \in \Omega_i$}{
      \uIf(\tcp*[h]{$o_i$ unexpected:}){$ o_i \not\in \Omega'_i $}{
        $\eta(n, o_i) \gets n$ \label{alg:MCJESP_selfloop}
        \tcp*[h]{add self-loop}\;
      }
      \Else(\tcp*[h]{Else: create next node}){
        $b'_{BR} \gets B'_{i}[o_i]$ \label{alg:MCJESP_BeliefUpdate} \;
        $w' \gets \frac{|B'_{i}[o_i]|}{|B'_{i}|} \cdot n.w$  \label{alg:MCJESP_NewWeight} \;
        \uIf(\; \tcp*[h]{Similar node exists in FSC or FSC full?}\;
        \tcp*[h]{Yes: Merge with closest node in FSC}){
          $( b'_{BR} \in N(\epsilon) ) \vee ( |N| = N_{max\mhyphen \fsc} ) $ \label{alg:MCJESP_CheckNewNode} }{
          $n' \gets N.findClosest(b'_{BR})$ \label{alg:MCJESP_FindClosestNode}  \;
          $n'.w \gets   n'.w  + w' $ \label{alg:MCJESP_UpdateClosestNodeWeight} \;
        }
      \Else(\tcp*[h]{No: Add new node}) {
          $a'_i \gets POMCP(b'_{BR}, G_{BR} )$ \label{alg:MCJESP_NewBestAction} \;
          $n' \gets node(b'_{BR}, a'_i, w')$  \label{alg:MCJESP_BuildNewNode}  \;
          $N \gets N \cup \{n'\}$\;
          $L[w'] \gets n' $ \label{alg:MCJESP_addNewNodeToG}\;
        }

        $\eta(n, o_i) \gets n'$ \tcp*[h]{Add transition to FSC.} \label{alg:MCJESP_transitionNewNode} \;
      }
    }
  }
}

  \Fct{\ProcessAction{$b^t_{BR}, G_{BR}, a^t_i$}}{
       $\Omega^{t+1}_i \gets \emptyset $ \;
       $B^{t+1}_i \gets \emptyset$ \;
       \Repeat{Timeout() $\vee$ $(MinBeliefParticles(B^{t+1} ) >  N_{min\mhyphen part})$ }{ \label{alg:MCJESP_ProcessActionStop}
       	$e^t \sim b^t_{BR}$ \;
	$\langle e^{t+1}, o^{t+1}_i, r^{t+1}  \rangle \sim G_{BR}(e^t, a^t_i) $ \;
	\If{$ o^{t+1}_i \notin \Omega^{t+1}_i $}{
		 $\Omega^{t+1}_i \gets \Omega^{t+1}_i \cup \{  o^{t+1}_i \} $ \;
	}
	$B^{t+1}[o^{t+1}_i] \gets B^{t+1}[o^{t+1}_i]  \cup \{ e^{t+1} \} $ \;
	%\If{$MinBeliefParticles(B^{t+1} ) >  N_{min\mhyphen part}$}{ \label{alg:MCJESP_ProcessActionStop}
	%	break \;
	%}
       }
       \Return{$\Omega^{t+1}_i$, $B^{t+1}_i$}
     }

% \end{multicols}

\end{algorithm}



% \subsection{Main Algorithm}
% \label{sec:Main_Algo}

% MC-JESP's main procedure is presented in \Cshref{alg:MCJESP_main}.
% %
% To control the computational cost, in each iteration, the size of the computed FSC is bounded by a parameter  $N_{max\mhyphen \fsc} \in \mathbb{N}^*$.
% %
% The local search starts with $|\cI|$ initial $N_{max\mhyphen \fsc}$-node FSCs in $\fsc\equiv \langle \fsc_i \rangle_{i=1}^{|\cI|}$.
% %
% It then loops over each agent to improve an agent $i$'s policy by finding a $N_{max\mhyphen \fsc}$-nodes FSC that is the best-response to the current fixed FSCs $\fsc_{-i}$ of other $|\cI| - 1$ agents.
% %
% More specifically:
% %
% \begin{itemize}
% \item in \cref{codes:BuildExtendedG_MCJESP}, MC-JESP constructs a best-response generative model $G_{BR}$ for agent $i$ in each iteration;
% %
% \item MC-JESP uses a Monte-Carlo method (POMCP) to build agent $i$'s FSC node by node (\cref{codes:learnFSC});
% %
% \item since no explicit models are available, simulations are conducted in \cref{codes:PolicyEval_MCJESP} to estimate the value obtained using agent $i$'s policy $\fsc'_i$ under initial distribution $b^0_{BR}$ with simulator $G_{BR}$.
% %
% \end{itemize}
% %
%\olivier{What do you mean by ``On the other hand''? Do you mean that you will now talk about similarities between Inf-JESP and MC-JESP? If so, this should be more explicit, \eg, writing ``as in Inf-JESP'' somewhere.}
% %
% As the FSC size is bounded in MC-JESP, and each iteration improves the value monotonically (assuming perfect evaluation of FSCs).
% %
% Therefore, the search will necessarily terminate in a finite time.
% %
% In the end, the final solution may be close to a Nash equilibrium, given that POMCP is an approximate online solver.



% \begin{algorithm}
%   \caption{Monte-Carlo JESP}
%   \label{alg:MCJESP_main}
%   %
%   \DontPrintSemicolon
%   %
%   \SetInd{.3em}{.6em}
%  %\scalefont{.9}

%   \SetKwFunction{LocalSearch}{{\bf LocalSearch}}
%   \SetKwFunction{ComputeFSC}{{\bf ComputeFSC}}
%   \SetKwFunction{BuildBRGenerativeModel}{{\bf BuildBRGenerativeModel}}
%   \SetKwFunction{getBRGenerativeModel}{{\bf getBRGenModel}}
%   \SetKwFunction{EvalSim}{{\bf EvalSim}}

%   [Input:] %
%   $b^0$: initial belief $\mid$ %
%   $G$: Dec-POMDP simulator $\mid$ %
%   $\fsc$: initial FSCs \; %
%    % $\epsilon$: error gap $\mid$ %
%   \Fct{\LocalSearch{$b_0, G, \fsc$} }{
%     $v_{bestL} \gets eval(\fsc)$ \;
%     $\nNI \gets 0$   \tcp*[h]{\#(iterations w/o improvement)} \;
%     $i\gets 1$  \tcp*[h]{Id of current agent} \;
%     \Repeat(\tcp*[h]{Cycle over agents}){$\nNI=|\cI|$ \tcp*[h]{No improvement in last cycle.}}{
%       %\uwave{$G_{BR}, b^0_{BR} \gets \BuildBRGenerativeModel(G, b^0, \fsc_{-i})$} \label{codes:BuildExtendedG_MCJESP} \;
%       %\uwave{$\fsc'_i \gets $\ComputeFSC{$b^0_{BR}$ $\mid$ $G_{BR}$, $K$ }} \label{codes:learnFSC} \;
%       %\uwave{$v \gets \EvalSim(\fsc'_i,  b^0_{BR} \mid  G_{BR})$} \label{codes:PolicyEval_MCJESP} \;
%       $G_{BR}, b^0_{BR} \gets \getBRGenerativeModel(b^0, G, \fsc_{-i})$ \label{codes:BuildExtendedG_MCJESP} \;
%       $\fsc'_i \gets $\ComputeFSC{$b^0_{BR}, G_{BR}$ } \label{codes:learnFSC} \;
%       $v \gets \EvalSim(\fsc'_i,  b^0_{BR},  G_{BR})$ \label{codes:PolicyEval_MCJESP} \;
%       \uIf(\tcp*[h]{Keep new FSC if better}){$v > v_{bestL}$}{
%         $\fsc_i \gets \fsc'_i$\;
%         $v_{bestL} \gets v$\;
%         $\nNI \gets 0$\;
%       }
%       \Else(\tcp*[h]{increment $\nNI$}){
%         $\nNI \gets \nNI+1$\;
%       }
%       % $i \gets (i+1) \mod |\cI|$\;
%       $i \gets (i \mod |\cI|) + 1$\;
%     }%({\scriptsize \hfill \tt{// looped over agents w/o improving}})
%     \Return{$\langle \fsc, v_{bestL} \rangle$}
%   }
% \end{algorithm}




\subsection{Heuristic Initialization}
\label{sec:MCJESP_init}

%
Although MC-JESP monotonically improves the value of the joint policy at each iteration, random initializations often lead to poor local optima.
%
To benefit from a heuristic initialization that allows finding good solutions quickly and reliably,
%
we adapt \infJESP[]'s heuristics as we adapted the computation of an agent's FSC in the previous section: using particle sets as beliefs, calling a simulation-based solver, and bounding the number of nodes.
%
In addition, to derive the next belief, we marginalize over possible joint observations $o_{-i}$, rather than reasoning on them separately as \cite{InfJESP} did (\eg, considering only the most probable one).

% \hrule

% %\paragraph{Heuristic Initialization:}
% %\label{sec:MCJESP_heuristic}
% %
% %
% Although MC-JESP monotonically improves the value of the joint policy at each iteration, random initializations may often lead to poor local optima.
% %
% We would therefore like to develop a heuristic initialization method that allows finding good solutions quickly and reliably.
% %

This heuristic initialization assumes that
\begin{enumerate*}
\item agent $i$'s decisions are made as if all agents where sharing their observations, thus acting as a single agent; %
  this means making decisions (picking joint actions $\va$) by solving a Multi-agent POMDP (MPOMDP)  \citep{Pynadath-jair02} relaxation of the original Dec-POMDP, %
  which can be done here with an (online) simulation-based POMDP solver; and % such as POMCP; and
\item agent $i$'s belief $b$ over the hidden state is updated assuming %
  \begin{enumerate*}
  \item that the other agents ($-i$) also act according to the computed MPOMDP policy at $b$, but %
  \item using only $i$'s observation, $o_i$, while marginalizing over others' observations ($o_{-i}$, which are actually not known to $i$ at execution time) to ignore them.
  \end{enumerate*}
\end{enumerate*}
%
% Our heuristic initialization assumes shared observations in the Dec-POMDP between all agents.
% %
% This induces a relaxation of the problem as a Multi-agent POMDP (MPOMDP) \citep{Pynadath-jair02}.
% %
% Then, we extract individual FSCs using an MPOMDP solver. %, and those FSCs will be used to initialize MC-JESP.
% %
% More specifically, to extract an initial FSC for agent $i$, we assume that all agents share the same belief $b$ and act according to the same joint action $a$.
% %
% Then, the belief is updated by agent $i$ by marginalizing agents $-i$'s possible observations to ignore them.
% %
%
This \textit{one-sided-observation belief update} is computed as follows:
\begin{align*}
  %\label{eq:OneSideObservationBeliefUpdate}
  & b^{\va,o_i}(s') \eqdef Pr(s' | o_i, \va, b)
  = \frac{Pr( s', o_i, \va, b)}{Pr( o_i, \va, b)} \\
%  & = \frac{\sum_{s}\sum_{o_{-i}} Pr( s', s, \langle o_i, o_{-i} \rangle, \langle a_i, a_{-i} \rangle, b) }{\sum_{s'}\sum_{s}\sum_{o_{-i}} Pr( s', s, \langle o_i, o_{-i} \rangle, \langle a_i, a_{-i} \rangle, b)} \\
 % & = \frac{\sum_{o_{-i}} Pr( \langle o_i, o_{-i} \rangle | s', \langle a_i, a_{-i} \rangle) \sum_{s}Pr(s' | s, \langle a_i, a_{-i} \rangle) b(s)}{\sum_{s'}\sum_{s} \sum_{o_{-i}} Pr( \langle o_i, o_{-i} \rangle | s', \langle a_i, a_{-i} \rangle)Pr(s' | s, \langle a_i, a_{-i} \rangle) b(s)} \\
%  & = \frac{\sum_{o_{-i}} O( \langle o_i, o_{-i} \rangle | \langle a_i, a_{-i} \rangle, s') \sum_{s} T( s, \langle a_i, a_{-i} \rangle, s') b(s) }{ \sum_{s'}\sum_{s} \sum_{o_{-i}} O( \langle o_i, o_{-i} \rangle | \langle a_i, a_{-i} \rangle, s')  T(s, \langle a_i, a_{-i} \rangle, s') b(s)} .
%%%  & \qquad = \frac{\sum_{s, \vo_{-i}} Pr( s', s, \langle o_i, \vo_{-i} \rangle, \va, b) }{\sum_{s', s, \vo_{-i}} Pr( s', s, \langle o_i, \vo_{-i} \rangle, \va, b)} \\
  & \qquad = \frac{\sum_{\vo_{-i}} O( \langle o_i, \vo_{-i} \rangle | \va, s') \sum_{s} T( s, \va, s') b(s) }{ \sum_{s', \vo_{-i}} O( \langle o_i, \vo_{-i} \rangle | \va, s') \sum_s  T(s, \va, s') b(s)} .
\end{align*}
%

Following this idea, we derive
the FSC heuristic initialization process for agent $i$ detailed in
\Cshref{alg:MCJESP_heuristic} which,
%
as shown in red, differs from \Cshref{alg:MCJESP_BuildFSC} in two aspects:
%
\begin{itemize}
\item the Dec-POMDP simulator $G$ is used as an MPOMDP simulator for POMCP to get the best joint action with a given belief (\linesrefAnd{alg:MCJESP_heuristic_initial_best_action}{alg:MCJESP_heuristic_NewBestAction}); and
  \item in \cref{alg:MCJESP_heuristic_process_action}'s \ProcessAction function, the next estimated beliefs are obtained by repeatedly sampling transitions using the computed joint action $\langle n.a_i, n.\va_{-i} \rangle$ (and Dec-POMDP simulator $G$), and collecting particle sets for each encountered individual observation $o_i$, ignoring $\vo_{-i}$, which is equivalent to a marginalization.
% \item instead of a standard belief update, we use a one-sided-observation belief update
% %
%   method within \cref{alg:MCJESP_heuristic_process_action}'s \ProcessAction function, which repeatedly samples transitions using the computed joint action (and the original Dec-POMDP simulator) and collects particles for each encountered agent $i$'s observation.
%
\end{itemize}

% \begin{algorithm*}
%   \caption{Build a heuristic FSC for agent $i$}
%   \label{alg:MCJESP_heuristic}
%   %
%   \DontPrintSemicolon
%   \SetInd{.3em}{.6em}


% \begin{multicols}{2}


%   {[Input:]}
%   $b^0$: initial state distribution $\mid$
%   %\linebreak %
%   $i$: agent index \;
%   {[Parameters:]}
%   $G$: Dec-POMDP simulator $\mid$
%   \linebreak %
%   $N_{max\mhyphen \fsc}$: max. FSC size \; %for agent $i$ \;


% $ \textcolor{red}{\langle a^{0}_{i}, \va^{0}_{-i} \rangle   \gets POMCP(b^0, G)} $ \label{alg:MCJESP_heuristic_initial_best_action} \;
% $  \textcolor{red}{B^{1}_i, \Omega^1_i  \gets \ProcessAction(b^{0}, G, \langle a^{0}_{i}, \va^{0}_{-i} \rangle)}$ \label{alg:MCJESP_heuristic_process_action_start} \;
%   $n_0 \gets node(b^0, a^0_i, B^1_i, \Omega^1_i, w=1)$   \label{alg:MCJESP_heuristic_start_node}  \;
%   $N \gets  \{n_0\}$  \tcp*[f]{init FSC \& open list} \;
%   $L[w] \gets n_0 $ \;
%   \While(\tcp*[f]{loop over open nodes}){$|L| > 0$ }{
%         $L.sort()$ \label{alg:MCJESP_heuristic_sortG} \;
%         $n \gets L.popfront()$  \label{alg:MCJESP_heuristic_PopNode} \;
%         $\langle b, a \rangle \gets n $ \;
%         \For(\tcp*[h]{For each obs. of $i$:}){$ o_i \in \Omega_i$}{
%           \uIf(\tcp*[h]{$o_i$ unexpected: add self-loop}){$ o_i \not\in n.\Omega_i $}
%           {
%             $\eta(n, o_i) \gets n$ \label{alg:MCJESP_heuristic_selfloop}
%           }
%           \Else(\tcp*[h]{Else: create next node:}){
%             $b' \gets n.B'_{i}(o_i)$ \;
%             $w' \gets n.w * \frac{|B'_{i}(o_i)|}{|B'_{i}|}$ \;
%             $ \textcolor{red}{\langle a'_{i}, \va'_{-i} \rangle   \gets POMCP(b', G)} $ \label{alg:MCJESP_heuristic_NewBestAction} \;
%             %
%             \uIf(\; \tcp*[h]{Similar node exists in FSC or FSC full?}\;\tcp*[h]{No: Add new node}){
%               $( b' \notin N(\epsilon) ) \wedge ( |N| < N_{max\mhyphen \fsc} ) $ \label{alg:MCJESP_heuristic_CheckNewNode}
%             }{
%               $ \textcolor{red}{ B^{''}_i, \Omega^{''}_i  \gets \ProcessAction(b', G, \langle a'_{i}, \va'_{-i} \rangle )}$ \label{alg:MCJESP_heuristic_process_action} \;
%               $n' \gets node(b', a'_i, B^{''}_i, \Omega^{''}_i, w)$  \label{alg:MCJESP_heuristic_BuildNewNode}  \;
%               $N \gets N \cup \{n'\}$\;
%               $L[w] \gets n' $ \label{alg:MCJESP_heuristic_addNewNodeToG}\;
%             }
%             \Else(\tcp*[h]{Yes: Merge with closest node in FSC}){
%               $n' \gets N.find(b')$ \label{alg:MCJESP_heuristic_FindExsitNode}  \;
%               $n'.w \gets   n'.w  + w' $ \label{alg:MCJESP_heuristic_UpdateExisitNodeWeight} \;
%             }

%             $\eta(n, o_i) \gets n'$ \label{alg:MCJESP_heuristic_transitionNewNode} \;
%           }
%         }
% }

% \end{algorithm}



\begin{algorithm}
  \caption{Build a heuristic FSC for agent $i$}
  \label{alg:MCJESP_heuristic}
  %
  \DontPrintSemicolon
  \SetInd{.3em}{.6em}

  {[Input:]}
  $b^0$: initial state distribution $\mid$
  %\linebreak %
  $i$: agent index  \; %\;
  {[Parameters:]}
  $G$: Dec-POMDP simulator $\mid$ %
  \linebreak %
  $N_{max\mhyphen \fsc}$: max. FSC size $\mid$ %
  $\epsilon$: min. distance between beliefs \; %for agent $i$ \;


$ \textcolor{red}{\langle a^{0}_{i}, \va^{0}_{-i} \rangle   \gets POMCP(b^0, G)} $ \label{alg:MCJESP_heuristic_initial_best_action} \;
  $n_0 \gets node(b^0, a^0_i, \va^{0}_{-i}, w=1)$   \label{alg:MCJESP_heuristic_start_node}  \;
  $N \gets  \{n_0\}$  \tcp*[f]{init FSC \& open list} \;
  $L[w] \gets n_0 $ \;
  \While(\tcp*[f]{loop over open nodes}){$|L| > 0$ }{
        $L.sort()$ \label{alg:MCJESP_heuristic_sortG} \;
        $n \gets L.popfront()$  \label{alg:MCJESP_heuristic_PopNode} \;
        %$\langle b, a \rangle \gets n $ \;
        $ \textcolor{red}{ \Omega'_i, B'_i  \gets \ProcessAction(n.b, G, \langle n.a_{i}, n.\va_{-i} \rangle )}$ \label{alg:MCJESP_heuristic_process_action} \;
        \For(\tcp*[h]{For each obs. of $i$:}){$ o_i \in \Omega$}{
          \uIf(\tcp*[h]{$o_i$ unexpected: add self-loop}){$ o_i \not\in \Omega'_i $}
          {
            $\eta(n, o_i) \gets n$ \label{alg:MCJESP_heuristic_selfloop}
          }
          \Else(\tcp*[h]{Else: create next node:}){
            $b' \gets B'_{i}[o_i]$ \;
            $w' \gets n.w * \frac{|B'_{i}[o_i]|}{|B'_{i}|}$ \;
            %$ \textcolor{red}{\langle a'_{i}, \va'_{-i} \rangle   \gets POMCP(b', G)} $ \label{alg:MCJESP_heuristic_NewBestAction} \;
            %
            %\uIf(\; \tcp*[h]{Similar node exists in FSC or FSC full?}\;\tcp*[h]{No: Add new node}){
             % $( b' \notin N(\epsilon) ) \wedge ( |N| < N_{max\mhyphen \fsc} ) $ \label{alg:MCJESP_heuristic_CheckNewNode}
            %}{
              %$n' \gets node(b', a'_i, \va'_{-i}, w)$  \label{alg:MCJESP_heuristic_BuildNewNode}  \;
             % $N \gets N \cup \{n'\}$\;
             % $L[w] \gets n' $ \label{alg:MCJESP_heuristic_addNewNodeToG}\;
           % }
           % \Else(\tcp*[h]{Yes: Merge with closest node in FSC}){
             % $n' \gets N.find(b')$ \label{alg:MCJESP_heuristic_FindExsitNode}  \;
             % $n'.w \gets   n'.w  + w' $ \label{alg:MCJESP_heuristic_UpdateExisitNodeWeight} \;
           % }
               \uIf(\; \tcp*[h]{Similar node exists in FSC or FSC full?}\;
              \tcp*[h]{Yes: Merge with closest node in FSC}){
          $( b' \in N(\epsilon) ) \vee ( |N| = N_{max\mhyphen \fsc} ) $ \label{alg:MCJESP_heuristic_CheckNewNode} }{
          $n' \gets N.findClosest(b')$ \label{alg:MCJESP_heuristic_FindClosestNode}  \;
          $n'.w \gets   n'.w  + w' $ \label{alg:MCJESP_heuristic_UpdateClosestNodeWeight} \;
        }
      \Else(\tcp*[h]{No: Add new node}) {
           $ \textcolor{red}{\langle a'_{i}, \va'_{-i} \rangle   \gets POMCP(b', G)} $ \label{alg:MCJESP_heuristic_NewBestAction} \;
              $n' \gets node(b', a'_i, \va'_{-i}, w')$  \label{alg:MCJESP_heuristic_BuildNewNode}  \;
              $N \gets N \cup \{n'\}$\;
              $L[w] \gets n' $ \label{alg:MCJESP_heuristic_addNewNodeToG}\;
        }



            $\eta(n, o_i) \gets n'$ \label{alg:MCJESP_heuristic_transitionNewNode} \;
          }
        }
}

%\end{multicols}

\end{algorithm}


\subsection{Observations}

% {\scriptsize
% With an increasing time budget, POMCP asymptotically converges to optimal decisions.
% %
% By %
% (i) increasing POMCP's time budget to infinity, %
% (ii) increasing $N_{min\mhyphen part.}$ to infinity, and %
% (iii) setting $\epsilon=0$, %
% each iteration of the local search would thus return the best response among the subset of FSCs of size $N_{max\mhyphen \fsc}$ that the FSC extraction process can produce.
% %
% \vincent{As discussed, I do not think this is true by default. The fact that you merge FSC nodes after $N_{max\mhyphen \fsc}$ nodes might prevent the algorithm from finding the best FSC among the subset of FSCs of size $N_{max\mhyphen \fsc}$.}
% %
% As a consequence, assuming also an exact evaluation of FSCs, the local search would be guaranteed to find a Nash equilibrium (in this subset of FSCs).
% %
% Increasing $N_{max\mhyphen \fsc}$ would then allow better representing POMCP's optimal policy and converging to ideal Nash equilibria.
% }

With an increasing time budget, POMCP asymptotically converges to optimal decisions.
%
By %
(i) increasing POMCP's time budget to infinity, %
(ii) increasing $N_{min\mhyphen part.}$ and $N_{max\mhyphen \fsc}$ to infinity, % (using exact belief approximations and unbounded FSCs), %
and % 
(iii) setting $\epsilon=0$, % (only merging identical beliefs),
each iteration of the local search would thus return the best response (possibly infinite) FSC.
%
As a consequence, assuming also an exact evaluation of FSCs, the local search would be guaranteed to find a Nash equilibrium.

In practice, we only obtain approximate Nash equilibria.
%
Also, due to randomization in POMCP and in FSC evaluations through simulations, restarts of the full process lead to different search trajectories, but always stop in finitely many iterations with probability 1.
%
The next section looks at the results obtained in practice through experiments.




\section{Experiments}
\label{sec:exp}

We evaluate our contributions on five state-of-the-art Dec-POMDP benchmarks (\cf \url{http://masplan.org/problem_domains}):
%
% \begin{itemize}
% \item
Decentralized Tiger %(Dec-Tiger)
\citep{JESP},
% \item
Recycling Robots %(Recycling)
\citep{Amato2007UAI},
% \item
Meeting in a $3 \times 3$ grid %(Grid $3 \times 3$)
\citep{amato2009incremental},
% \item
Cooperative Box Pushing % (Box-Pushing)
\citep{Seuken2007ImprovedMD},
% \item
Mars Rover \citep{amato2009achieving}.
%\end{itemize*}
%
We compare MC-JESP with state-of-the-art Dec-POMDP solvers relying on either:
%
%\begin{itemize*}
%\item
{\em explicit models:} (which benefit from more information)  FB-HSVI \citep{DibAmaBufCha-jair16}, Peri \citep{PajPel-nips11}, PeriEM \citep{PajPel-nips11}, PBVI-BB \citep{MacIsb-nips13} and \infJESP; or
% \item
{\em generative models:} MCEM \citep{Wu2013}, Dec-SBPR \citep{Liu2015} and oSARSA \citep{pmlr-v80-dibangoye18a}.
%\end{itemize*}
%
%\uwave{Although MC-JESP is a simulation-based algorithm, we still compared it with the methods which rely on explicit models to show MC-JESP's power.}

For MC-JESP,
\begin{enumerate*}
\item POMCP is used as our simulation-based POMDP planner with a timeout of $1$\,s;
  %
\item we consider three different maximum FSC sizes: 10, 30, and 50, respectively;
%  \olivier{What about $\epsilon$ and the number of simulations in evaluations? Is the following ok?}
  %
\item the threshold distance between beliefs is set to $\epsilon=0.1$; and
  %
\item FSC evaluations (\cref{codes:PolicyEval_infJESP} of \Cshref{alg:JESP_main}) use $10^6$ simulations that stop when $\gamma^t < 10^{-4}$.
\end{enumerate*}
%
For MC-JESP and \infJESP's empirical results, having access to the exact model in each benchmark problem, we use \citeauthor{Hansen-nips97}'s [\citeyear{Hansen-nips97}] policy evaluation for FSCs applied to a best-response POMDP.
%
%\yang{MC-JESP's exact evaluation is more complicated than Inf-JESP. 1: Compute All agents' FSCs 2: Use $Dec-POMDP$ + $FSC_{-i}$ = $POMDP_i$ 3: Exact evaluation using $POMDP_i$  and $fsc_i$}.
The experiments with MC-JESP were conducted on a laptop with a 2.3 GHz i9 CPU.
%
The source code is available at \url{https://gitlab.inria.fr/anr-fcw/mcjesp}.



\subsection{Comparison with state-of-the-art algorithms}
\label{sec:results}

% \begin{adjustbox}{
%     center=\linewidth,
%     float=table,
%     caption = {Comparison of different algorithms in terms of final FSC size, number of iterations, time, and value, on five infinite-horizon benchmark problems with $\gamma = 0.9$ for all domains.},
%     label=Table:MCJESP_BenchmarksResults,
% %    max height= 0.5\textheight,
%     }
\begin{table}
  \centering
  \caption{Comparison of different algorithms in terms of final FSC size, number of iterations, time, and value, on 5 infinite-horizon benchmarks with $\gamma = 0.9$ for all domains.
%
The solvers are listed in a decreasing order of value.
    %\newline
  }
  \label{Table:MCJESP_BenchmarksResults}
  \resizebox{\linewidth}{!}{%
   \begin{tabular}{
  S
  c
  S
  S[round-mode = places, round-precision = 0, table-format=2.0]
  S[round-mode = places, round-precision = 2, table-format=2.2]
  }
  \toprule
  {Alg.} & {FSC size} & {Iterations} & {Time (s)} & {Value}\\
  \midrule
  \multicolumn{5}{c}{ {DecTiger} {($|\cI|=2, |\cS|=2, |\cA^{i}|=3, |\cZ^{i}|=2 $)}} \\ 
  \midrule
  {FB-HSVI*} & & & 153.7 & 13.45 \\
    {Peri*} & & & 220 & 13.45  \\
    {INF-JESP*} & {$ 6 \times 6 $} & 27 & 213 & 13.44 \\
    \textbf{MC-JESP(M-{$20$})} & {$ 30 \times 30 $} & 5 & 5620  & 13.44 \\
  {PeriEM*} & & & 6450 & 9.42 \\
   {oSARSA} & & &  & -0.2  \\
  \textbf{MC-JESP(M-{$1_{20}$})} & {$ 24 \times 25 $} & 4 & 281 & -2.33 \\
  {Dec-SBPR} & & & 96 & -18.63  \\
  {MCEM} & & & 20 & -32.31  \\

  \midrule
  \multicolumn{5}{c}{{Recycling} {($|\cI|=2, |\cS|=4, |\cA^{i}|=3, |\cZ^{i}|=2 $)}} \\
  \midrule
  {FB-HSVI*} & & & 2.6 & 31.929 \\
  \textbf{MC-JESP(M-{$20$})} & {$ 17 \times 17 $} & 3 & 5260 & 31.92 \\
  {Peri*} & & & 77 & 31.84  \\
  {PeriEM*} & & & 272 & 31.80 \\
  {INF-JESP*} & {$ 2 \times 2 $} & 3 & 3.1 & 31.62 \\
    {Dec-SBPR} & & & 147 & 31.26  \\
  \textbf{MC-JESP(M-{$1_{20}$})}& {$ 19 \times 20 $} & 4 & 263 & 30.74 \\

  \midrule
  \multicolumn{5}{c}{{Grid3*3} {($|\cI|=2, |\cS|=81, |\cA^{i}|=5, |\cZ^{i}|=9 $)}} \\
  \midrule
  {INF-JESP*} & {$ 8 \times 10 $} & 3 & 2 & 5.81 \\
  \textbf{MC-JESP(M-{$20$})} & {$ 50 \times 50 $} & 4 &  11900 & 5.81 \\
  {FB-HSVI*} & & & 67 & 5.802 \\
  \textbf{MC-JESP(M-{$1_{20}$})}& {$ 50 \times 50 $} & 6 & 595 & 5.80 \\
  {Peri*} & &  & 9714 & 4.64 \\
  \midrule
  \multicolumn{5}{c}{{Box-pushing} {($|\cI|=2, |\cS|=100, |\cA^{i}|=4, |\cZ^{i}|=5 $)}} \\
  \midrule
  {FB-HSVI*} & &  & 1715.1 & 224.43 \\
    \textbf{MC-JESP(M-{$20$})} & {$ 50 \times 50 $} & 6 & 9740  & 223.84 \\
      \textbf{MC-JESP(M-{$1_{20}$})}& {$ 50 \times 50 $} & 5 & 487 & 220.94 \\
   {INF-JESP*} & {$ 250 \times 408 $} & 6 & 963 & 220.25 \\
  {Peri*} & & & 5675 & 148.65 \\
     {oSARSA} & & &  & 144.57  \\
  {PeriEM*} & & & 7164 & 106.65 \\
      {Dec-SBPR} & & & 290 & 77.65  \\

  \midrule
  \multicolumn{5}{c}{{Mars Rover} {($|\cI|=2, |\cS|=256, |\cA^{i}|=6, |\cZ^{i}|=8 $)}} \\
  \midrule
  {FB-HSVI*} & & & 74.31 & 26.94 \\
     {INF-JESP*} & {$ 125 \times 183 $} & 6 & 122 & 26.91\\
    \textbf{MC-JESP(M-{$20$})} & {$ 50 \times 50 $} & 5 & 6980 & 26.45 \\
  \textbf{MC-JESP(M-{$1_{20}$})}& {$ 50 \times 50 $} & 3 & 349 & 25.89 \\
  {Peri*} & & & 6088 & 24.13 \\
    {Dec-SBPR} & & & 1286 & 20.62  \\
  {PeriEM*} & & & 7132 & 18.13 \\ 
  \bottomrule
\end{tabular}
  }
\end{table}
%\end{adjustbox}

\Cref{Table:MCJESP_BenchmarksResults} presents the results for the 5 problems, the solvers being ordered from best to worse value. %obtained on the five standard Dec-POMDPs we used as benchmark.
%
Among $x$ restarts of MC-JESP, the best value is reported in MC-JESP(M-$x$), and the average value in MC-JESP(M-$1_x$) (to look at the benefit of restarting).
%
For \infJESP, we report the best values among its 3 possible initializations \citep{InfJESP}.
%
For MC-JESP, we report the best values over the 3 possible max. FSC sizes.
%
%For MC-JESP(M-$x$), we kept the highest value among $x$ restarts, and then report the average of these value over the various runs in MC-JESP(M-$1_x$).
%
The columns provide:
%
\begin{itemize*}
%
\item (\textit{Alg.}) the different algorithms at hand, with a $*$ exponent for those who rely on an explicit model;
%
\item (\textit{FSC size}) the final FSC size  (for \infJESP[s] and MC-JESP);
  %\olivier{Indicate earlier (when giving implementation details) that, for MC-JESP, you bound the FSC sizes to 50.}
%
\item (\textit{Iterations}) the number of iterations required to converge (for \infJESP[s] and MC-JESP);
%
\item (\textit{Time}) the running time;
%
\item (\textit{Value}) the final value (lower bounds for \infJESP[s] and MC-JESP, the true value being at most 0.01 higher).
%
\end{itemize*}
%
In terms of final value achieved, MC-JESP(M-$1_{20}$) finds good solutions in all cases except DecTiger (which is a small but difficult coordination problem), and MC-JESP(M-$20$) obtains  results very close to FB-HSVI's near-optimal solutions, which rely on an explicit Dec-POMDP model, for all benchmark problems.
%
MC-JESP even sometimes achieves better results than some other explicit model-based algorithms.
%
Also, compared with other simulation-based methods, it dominates in large problems (Box-pushing and Mars Rovers), while other simulation-based methods fail.

%
However,  compared with the explicit model-based algorithms, MC-JESP requires more solving time.
%
For example, in large problems such as Mars Rovers, MC-JESP takes $349$\,s on average to solve the task, while \infJESP takes $122$\,s.
%
But this is not surprising since MC-JESP only uses a black-box simulator.
%
% and, with restarts, MC-JESP can give good solutions within an acceptable time. %
A key question is how to determine whether restarting can be beneficial.





\subsection{A Closer Look at MC-JESP's behavior}
%\olivier{Same comment as for corresponding subsubsection for Inf-JESP.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Figures with results
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}%[H]
  \def\scale{0.95}
       \centering
%        \olivier{{\bf FONT TOO SMALL (digits)!}}
     %     Also,:
     %     \begin{itemize}
     %     %\item Remove ``Avg ValueFSC / num iteration'' above graphs. This is redundant with X and Y labels.
     %     \item Remove ``with 0.1 and 0.9 quartiles'' (which is wrong, since you should write ``percentiles'', and is already (and correctly written) in the caption.
     %     % \item Set colors to Red / Green / Blue instead of Red / Orange / Blue.
     %     %  [So, just replace orange by green.]
     %     \end{itemize}
     % }

       \includegraphics[width=\scale\columnwidth]{images/MCJESP_result_figures/DecTiger/DecTiger_byIter_ValueFSC_error_histo.pdf}
       \hfill
       \includegraphics[width=\scale\columnwidth]{images/MCJESP_result_figures/Grid/Grid_byIter_ValueFSC_error_histo.pdf}

       \includegraphics[width=\scale\columnwidth]{images/MCJESP_result_figures/Recycling/Recycling_byIter_ValueFSC_error_histo.pdf}
       \hfill
       \includegraphics[width=\scale\columnwidth]{images/MCJESP_result_figures/Box-Pushing/Box-Pushing_byIter_ValueFSC_error_histo.pdf}

       \includegraphics[width=\scale\columnwidth]{images/MCJESP_result_figures/Mars/Mars_byIter_ValueFSC_error_histo.pdf}
       \hfill
       \begin{minipage}[b]{\columnwidth}
       \caption{%
         %
         Values of the joint policy for the Dec-Tiger, Grid, Recycling, Box-Pushing, and Mars Rover problems (from top to bottom).
         %
         The left part of each figure presents the evolution (during a
         run) of the value of the joint policy at each iteration of MC-JESP($1_{20}$) (avg + 10th and 90th percentiles) with different bounded FSC sizes (10, 30, and 50, respectively).
       %
       The dashed line represents FB-HSVI's final value.
       %
       The right part presents the value distribution after convergence of MC-JESP($1_{20}$).
     }
     \label{Figure:MCJESP_ValueIterationAndFinal}
     \end{minipage}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{figure}%[H]
       \centering
       % \olivier{
       %   \begin{itemize}
       %   \item Font too small (digits along axes).
       %   % \item Remove white margins outside the graphs (also in other figures), e.g., using a shell command like {\tt pdfcop} .
       %   \item Try to make the graph more rectangular ???
       %   \end{itemize}}

       \includegraphics[width=.85\linewidth]{images/MCJESP_result_figures/ImpactTimeout.pdf}
       \caption{%
         %
         Values of the joint policy for the Dec-Tiger problem for different POMCP timeout values.
         % %
         % The x-axis indicates the different POMCP timeout used when building the FSC node in the MC-JESP algorithm.
         % %
         % The y-axis represents the final value obtained through the MC-JESP method.
     }
     \label{Figure:MCJESP_ImpactTimeout}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We study MC-JESP's performance with three different maximum FSC sizes in \Cshref{Figure:MCJESP_ValueIterationAndFinal} (red for 10, green for 30, and blue for 50).
%
% \Cref{Figure:MCJESP_ValueIterationAndFinal} (right parts)
Right parts present the distribution over final values of MC-JESP with 20 restarts.
%
In the five problems at hand, MC-JESP with max FSC size 50 (blue) has distributions more concentrated on good values than others, and most values are close to FB-HSVI's ones (thus, near-optimal values).
%
These distributions show that few restarts are needed to reach good solutions with high probability if we give large enough FSC sizes. % for the given problem. %
%

%
The left parts of \Cshref{Figure:MCJESP_ValueIterationAndFinal} present the evolution of the values during each iteration of MC-JESP with the three maximum FSC sizes.
%
The average is computed over all runs, even if they have already converged.
%
This figure first shows that MC-JESP monotonically increases during each run, and most runs converge to good local optima in a few iterations.
%
Second, we observe that, for large problems (Box-Pushing and Mars Rovers), there are already significant drops from MC-JESP in the first iteration with an FSC size limit decreasing from 50 to 10.
%
This indicates that, for large problems, we must give large enough FSC size limits, while this is not necessary for small problems.

Last but not least, in Dec-Tiger, although some restarts of MC-JESP end with optimal values, we observe that the average value is still relatively low compared with FB-HSVI.
%
Therefore, we conducted another experiment to investigate the impact of different POMCP timeouts (note that there is a fixed timeout of $1$\,s for the experiments illustrated in \Cref{Figure:MCJESP_ValueIterationAndFinal}).
%
To that end, we limit the FSC size in each iteration to at most 50 nodes, and we test MC-JESP with five POMCP timeouts (1\,s,  5\,s, 10\,s, 20\,s, and 30\,s).
The distribution of final values is shown in \Cshref{Figure:MCJESP_ImpactTimeout}.
%
We observe that the average value increases and the variability shrinks when we give more time to POMCP.
%
However, it also indicates that, when we increase the time budget,  we have a lower chance of getting "lucky" good values.
%

%\subsection{Supplementary Experiments with a Continuous-State Dec-POMDP}

\section{Conclusion}
\label{sec:conclusion}

In this work, based on \infJESP, we propose a novel infinite-horizon Dec-POMDP solver called MC-JESP,  which only requires a black-box Dec-POMDP simulator, and returns FSCs, \ie, representations that can make for interpretable policies.
%
We describe how to obtain a best-response generative model (the simulator of the POMDP faced by some agent $i$ assuming known FSCs for other agents), and the process to extract an FSC for each agent.
%
Moreover, a heuristic initialization method for MC-JESP is also provided.

Through experiments, we prove that MC-JESP preserves \infJESP[]'s competitive results (though at the cost of an increased computation time), performing better than many explicit model-based algorithms, and outperforming other simulation-based algorithms in most cases.
%
Because it seeks Nash equilibria, this approach could better scale up to large problems than approaches directly seeking global optima.

% Through experiments, we prove that MC-JESP has competitive results, and even performs better than many explicit model-based algorithms.
% %
% We believe that the good performance may be due to the "best-response iteration" itself but is not linked to the completeness of the model (explicit or generative model).
% %
% This would explain why MC-JESP and \infJESP share the same good performance.
% %


% \olivier{Future Work: %
%   ``Safely'' compare FSCs $\fsc_i$ and $\fsc'_i$ while minimizing computational costs through hypothesis testing; %
%   Try to re-use POMCP trees from one node to the next, or to initialize \ProcessAction (warning about memory usage); %
%   If using large FSCs, use space partitioning (\eg, $k$-d trees \citep{10.1145/361002.361007} or cover trees \citep{beygelzimer2006cover}) to speed up the search for nearest nodes. %
% }

Several improvements of MC-JESP could be envisioned, such as:
\begin{enumerate*}
\item robustly comparing FSCs $\fsc_i$ and $\fsc'_i$, while minimizing computation time through hypothesis testing; %
\item if using large FSCs, using space partitioning (\eg, $k$-d trees \citep{10.1145/361002.361007} or cover trees \citep{beygelzimer2006cover}) to speed up the search for nearest nodes; and %
\item
% \vincent{\sout{maybe}}
%
re-using POMCP trees from one node to the next, or to initialize \ProcessAction, although doing so may significantly increase memory usage. %
\end{enumerate*}

Also, preliminary experiments show that MC-JESP works on a continuous-state meet-in-a-grid problem, the main issue being to replace the distance between sets of discrete particles (i.e., just comparing two vectors representing discrete distributions)
by a distance over continuous particles (which requires taking the distance between states into account).
%
% % 
% Also, preliminary experiments show that MC-JESP works on a continuous-state meet-in-a-grid problem by essentially adapting the distance between particle sets.
% %
% \vincent{I did not understand what the last part of the sentence ('adapting the distance between particle sets') meant. Maybe, detailing this with one sentence would make things clearer.}
%
For future works, we plan to extend MC-JESP to problems with continuous actions and observations.
%
This would require not only relying on algorithms such as \citeauthor{sunberg2018online}'s POMCPOW [\citeyear{sunberg2018online}], but also, more importantly, deriving FSCs that can handle continuous observations.
%
%ß(Unlike the existing methods for solving continuous POMDPs, such as POMCPOW, we cannot solving a Dec-POMDPs online )



% References
\bibliography{you_212}
\end{document}

%============================================================

% [Olivier] One line to tell emacs to use french/american/\dots spelling:
% Local IspellDict: american
