\section{Experimental Evaluation}\label{sec:expts}

% \section{Implementation and Evaluation}\label{sec:expts}

% \marta{Moved the first two items to near the example. \cite{MEA-EB-PK-AL:20} define strategies (Defn 6) as 'memoryless, incomplete information and non-uniform setting' }

% \ruitodo{Yes, they can consider any strategy because what they did is to verify a given strategy, but what we are doing is exploring the strategy space and finding a good one in the game sense. This is more tricky and importing incomplete information will be problematic for equilibria analysis.}

% \ruitodo{
% The main differences between our VCAS model and what is used in Akitunde' paper include:
% \begin{itemize}
%     \item We separate the local states and the environment state by introducing the beliefs as private states, while their paper replicates the climbrates and puts them into both local states and the environment state;
%     \item We consider a stochastic model due to probabilistic local transition functions, while their model is deterministic;
%     \item We embed reward structures into the agents and compute the equilibrium strategies. Supposing that each agent adopts a given protocol function (they call it strategy), they consider a reachability property on safety, which can be taken as a zero-sum case of our model.
% \end{itemize}
% }

We have implemented a prototype version of our FSI method (\algoref{alg:FSI}).
This uses components from PRISM-games 3.0 \citep{MK-GN-DP-GS:20-2},
which supports discrete CSGs without perception.
In particular, we use its SMT-based/linear programming method for synthesising CSG SWNE/SWCE
to initialise the vector of equilibria values in line 1 of Algorithm \ref{alg:FSI}.
Its support for two-player finite-horizon equilibria~\citep{KNPS19}
also gives an equivalent version of the GBI algorithm (\algoref{alg-BI-SWNE}).
%to demonstrate its applicability and benefits of equilibria strategies.
%

The optimisation problems for computing SW-SPNE and SW-SPCE values for states are solved using Gurobi.
%We solve the optimisation problems for computing SW-SPNE and SW-SPCE values for states using %Gurobi, %\cite{gurobi}
%{\color{red} reducing nonlinear programs to quadratic through the use of additional variables.}
%\gabrielrev{It is worth mentioning that, in order to circumvent a restriction imposed by Gurobi, which does not allow the construction of terms that include multiplication of more than two variables, further variables and constraints need to be added to the nonlinear program, which is modified in a straightforward manner.} 
% In order to improve the scalability,
% our implementation considers the variant of FSI over regions. 
In order to improve the scalability of FSI,
our implementation considers a reduced set of histories by:
(i) limiting the information that the players have access to at each state to be the values of the variables in that state plus \emph{time}, i.e., how many transitions have been made up until that point; and
(ii) constructing histories not over states, but \emph{regions} of states which are independent from a decision-making standpoint. 

%\marta{here I would just state the main differences in the encoding with Prism implementation}

%{\color{blue} 
% To deal with the assumption of perfect recall, requiring that the agents would need to differentiate between states %not only by the atomic propositions associated to them but also 
% also by the choices that were made up until any point in the game, the implementation differs from that described in Section \ref{sec:approx_algo} in that 
% 	(i) we limit the information that the players have access to at any given state to be the values of the variables in that state plus \emph{time}, that is, how many transitions have been made up until that point, and
% 	(ii) instead of considering individual paths, we consider \emph{regions} of the state space, which are independent from a decision-making standpoint. %{\color{blue} This adaptation was necessary in order to alleviate the strain posed by dealing with perfect recall, given players would need to differentiate between the states %not only by the atomic propositions associated to them but also 
% %also by the choices that were made up until any point in the game.} 
% %
% This compromise was necessary due to the difficulty of keeping track of all previous histories, be that at a state level, which would greatly increase the model's representation, or at the model checking level, which would require explicitly storing separate values for all states in all the subgames considered.

% \gabrielrev{Another alternative was to monolithically encode the whole game as one optimisation problem similarly to what is done in \cite{KDJ-MJK:19,MEA-EB-PK-AL:20}, which clearly would not scale considering the added complexity of computing equilibria values. Our prototype implementation balances these two alternatives by building a path whose length is limited by a given time horizon, and then expanding it into a region by considering all the states that could reach the last state of that path. Although in the worst case this approach would mean encoding the whole model at once, in the case of having one unique state to which all histories in the model would eventually converge, we show in the automated parking example described in sequence that it is possible to show improvement for equilibria strategies and values while considering a much smaller portion of the state space.}

Our evaluation employs two case studies:
the first is used to show the applicability of our equilibria improvement algorithm,
and the second to demonstrate the usefulness of equilibria properties for analysing NS-CSGs.
An overview is provided below, with more detail given in the appendix.

%to demonstrate its applicability and benefits of equilibria strategies.

% They considered the problems of optimizing social welfare, computing Nash equilibrium and price of anarchy with complete information. As for incomplete information, they assumed that the players have no knowledge about the other players' distances to the slots. For each agent, a prior distribution about the locations of other agents was introduced. Then, decisions are made based on this prior distribution.


\startpara{Automated Parking} We first formulate a dynamic vehicle parking problem as an NS-CSG
(a static assignment game is considered in, e.g.,~\citep{DA-OW-BX-BD-JL:11}).
%
There are 2 players (vehicles) targeting 2 parking slots in a $5\times 4$ grid,
shown in Fig.~\ref{fig:parking} (target cells are green, forbidden cells are red, black arrows show traffic rules).
We consider two reward structures. One minimises time, while
the other extends the first by giving a bonus to player 2 for visiting a designated cell (in yellow).
%
This is a discrete-state model in which percepts identify agent locations precisely.
We use it to compare the equilibria algorithms for two different time horizons $K=8$ and $K=6$.
%\rui{we can briefly state here the steps two players take per time, and the reward -1 per time, because this can help people to understand the reward sum stated in the results}
For this model, both vehicles get a reward of -1 for each move, vehicle 2 gets a reward of 5.5 when visiting the bonus cell and the speeds of vehicle 1 and 2 are of two and one grid cell per move, respectively.

%\marta{I think we can save space by referring simply to Algorithm 1 and 2 without 'FSI' or 'Gneralised'}
We first consider Nash equilibria. For the first reward structure, our FSI algorithm %(\algoref{alg:FSI}) 
%and Generalized BI via Recursive SWE (\algoref{alg-BI-SWNE}), 
and the GBI algorithm%(\algoref{alg-BI-SWNE})
, which only considers local SWNE values, both return the SW-SPNE strategy with reward sum $-5.0$ in Fig. \ref{fig:parking}~(top-left). For the second reward structure, FSI finds a new SW-SPNE strategy with reward sum $-4.5$ in Fig. \ref{fig:parking}~(top-right) giving a higher social welfare, while 
%Generalized BI via Recursive SWE 
GBI still returns the strategy on the left, which is not an SW-SPNE in this case.

%Similarly, we compare the FSI and Generalized BI algorithms when computing correlated equilibria. 
With correlated equilibria, 
for $K=8$ both algorithms produce the same strategy as in Fig. \ref{fig:parking}~(bottom-right), for which the reward sum is -1.5. We then reduce the time horizon to $K=6$. For this case, in the strategy constructed by the 
%Generalized BI 
GBI algorithm in Fig. \ref{fig:parking}~(bottom-left), vehicle 2 is instructed to move left in order to get the bonus, while vehicle 1 is instructed to park in the closest spot. However, given the shorter horizon, vehicle 2 does not have enough time to park in the remaining spot and the overall reward sum is -2.5. The possible final positions for vehicle 2 are indicated by the blue stars. In the strategy synthesised by the FSI algorithm, however, both cars park and the sum of rewards is higher. Table~\ref{tab:parking} shows statistics for the models constructed and the time for equilibria computation.

%\input{figures/tex/parking_figure.tex}
%\input{figures/tex/parking_figure_ce.tex}
\input{figures/tex/parking_figure_updated.tex}

\input{figures/tex/parking_results.tex}

% \ruitodo{I was thinking of setting the private state empty for the car-parking example, if you haven't added the private state into the model you built before. We can claim that the car-parking example is mainly used for validating the computation of SPE. The VCAS example is the real NS-CSG on which we will run our algorithm.}


% For environment state $s_E=(x_1,\dots,x_n,y_1,\dots,y_m)\in S_E$ and joint action $\alpha=(a_1,\dots,a_n)\in A$, then $\delta_E(s_E,\alpha)=(x_1',\dots,x_n',y_1',\dots,y_m')$ where $y_j'=y_j$ for all $j\in M$, and for all $i\in N$, if $a_i\neq\perp$, then $x_i'=x_i+a_iv_i\Delta\tau$ and if $a_i=\perp$, then $x_i'=x_i$, and $v_i\in\mathbb{R}_{>0}$ is vehicle $i$'s constant speed and $\Delta\tau$ is the time step.

% For each $i\in N$, we adopt the reward structure $r_i=(r_i^A,r_i^S)$, where for $s_i\in S_i$ and $\alpha\in A$,
% \begin{itemize}
%     \item if $a_i\neq\perp$ and $
%     \bar{x}_i\neq \bar{y}_j$ for all $j\in M$: $r_i^S(s_i)=-1$, and if there exists an $j\in M$ such that $\bar{x}_{i}+a_i=\bar{y}_{j}$ and $\bar{x}_k+a_k\neq\bar{y}_j$ for all $k\neq i$, then $r_i^A(s_i,\alpha)=10$; if $\bar{x}_i+a_i\neq\bar{y}_j$ for all $j\in M$ and $\bar{x}_i+a_i\neq \bar{x}_k+a_k$ for all $k\neq i$, then $r_i^A(s_i,\alpha)=0$; if there exists a $k\in N$ such that $\bar{x}_i+a_i= \bar{x}_k+a_k$, then $r_i^A(s_i,\alpha)=-20$;
    
%     \item if $a_i\neq\perp$ and $
%     \bar{x}_i=\bar{y}_j$ for some $j\in M$: if $\bar{x}_i\neq \bar{x}_k$ for all $k\neq i$, then $r_i^S(s_i)=10$ and $r_i^A(s_i,\alpha)=0$; otherwise, $r_i^S(s_i)=-20$ and $r_i^A(s_i,\alpha)=0$;
    
%     \item if $a_i=\perp$ and $
%     \bar{x}_i\neq \bar{y}_j$ for all $j\in M$, then $r_i^S(s_i)=-1$ and $r_i^A(s_i,\alpha)=0$;
    
%     \item if $a_i=\perp$ and $
%     \bar{x}_i=\bar{y}_j$ for some $j\in M$, then $r_i^S(s_i)=0$ and $r_i^A(s_i,\alpha)=0$.
% \end{itemize}

% \startpara{Two-Agent Aircraft Collision Avoidance Scenario} 

% We model the two-agent Vertical Collision Avoidance Scenario (VCAS) as an NS-CSG \cite{MEA-EB-PK-AL:20}. Each aircraft (ownship or intruder) is equipped with an NN-controlled collision avoidance system. Each second the system issues an advisory from which together with the current trust in the advisory, the pilot needs to make a decision about accelerations, aiming at avoiding a near mid-air collision (NMAC), a region where two aircraft are separated by less than $100$ ft vertically and $500$ ft horizontally. 

% We first define the environment as follows:
% \begin{itemize}
%     \item environment state $s_E=(h,\dot{h}_{\textup{own}},\dot{h}_{\textup{int}},\tau)$, where $h$ (ft) is the altitude of the intruder relative to the ownship, $\dot{h}_{\textup{own}}$ (ft/sec) is the vertical climbrate of the ownship, $\dot{h}_{\textup{int}}$ (ft/sec) is the vertical climbrate of the intruder, and $\tau$ (secs) is the time to loss of horizontal separation of two aircraft. The set of environment states is $S_E=[-3000,3000]\times[-2500,2500]\times[-2500,2500]\times[0,40]$;

%     \item dummy actions $A_E$ and protocal function $prot_E$.
% \end{itemize}
% Each agent $i\in \{\textup{own}, \textup{int} \}$ is defined as follows:
% \begin{itemize}
%     \item local state $s_i=(be_i,ad_i)$, where the belief level $be_i$ in previously-issued advisory is the private state and the advisory $ad_i$ is the percept. There are four belief levels $\{\text{4 (trust), 3 (weak trust), 2 (weak distrust), 1 (distrust)}\}$ and nine possible advisories \cite{MEA-EB-PK-AL:20-2}, so the set of local states is $S_i=[4]\times[9]$;
    
%     \item each action $a_i$ is an acceleration $\ddot{h}_i$: we set $A_i=\{0,\pm3.0, \pm 7.33, \pm 9.33, \pm 9.7, \pm11.7\}$;
    
    
%     \item observation function $obs_i$ computes an advisory according to the previous advisory $ad_i$ and updated environment state $s_E'$: $obs_i(ad_i,s_E')=\textup{argmax}(f_{ad_i}(h',\dot{h}_1',\dot{h}_2',\tau'))$, where $s_E'=(h',\dot{h}_1',\dot{h}_2',\tau')$, $\textup{argmax}:\mathbb{R}^9\to[9]$ returns the index of the largest component and $f_{ad_i}:\mathbb{R}^4\to\mathbb{R}^9$ is a function implemented via a feed-forward NN with four inputs, seven hidden layers of 45 nodes and nine outputs representing the score of each possible advisory. There are nine NNs $F=\{ f_i:\mathbb{R}^4\to\mathbb{R}^9  \,|\, i \in [9] \}$, each of which corresponds to an advisory;
    
%     \item for $s_i=(be_i,ad_i)\in S_i$ and $\alpha=(\ddot{h}_1,\ddot{h}_2,a_E)\in A$, the local transition function $\delta_i$ is defined as follows: if $ad_i$ is compliant with $\ddot{h}_1$ and $\ddot{h}_2$, then $be_i'=be_i+1$ with probability $1-\epsilon_0$ and $be_i'=be_i$ with probability $\epsilon_0$ if $be_i\leq3$, and $be_i'=be_i$ if $be_i=4$; if $ad_i$ is not compliant with $\ddot{h}_1$ and $\ddot{h}_2$, then $be_i'=be_i-1$ with probability $1-\epsilon_0$ and $be_i'=be_i$ with probability $\epsilon_0$ if $be_i\ge2$, and $be_i'=be_i$ if $be_i=1$. We choose $\epsilon_0=0.9$.
% \end{itemize}
% Now we define the transition function of the environment $\delta_E(s_E,\alpha)$: $h'=h+\Delta\tau(\dot{h}_1-\dot{h}_2)+0.5\Delta\tau^2(\ddot{h}_1-\ddot{h}_2)$, $\dot{h}_1'=\dot{h}_1+\ddot{h}_1\Delta\tau$, $\dot{h}_2'=\dot{h}_2+\ddot{h}_2\Delta\tau$ and $\tau'=\tau-\Delta\tau$ where $\Delta\tau=1$ is the time step. The reward structure $r_i=(r_i^A,r_i^S)$ is as follows: if $ad_i$ is compliant with $\alpha$ and $h$, then $r_i^A(s,\alpha)=2be_i$, and $be_i$ otherwise; $r_i^S(s)=\kappa h^2/(\tau+1)+be_i^2$. We here take $\kappa=0.1$.
%\\

%An alternative for local state and local transition function:

% \begin{itemize}
%     \item local state $s_i=(be_i, ad_i)$, where the belief level $be_i$ in the advisory system is the private state and the previous advisory $ad_i$ is the percept. There are four belief levels $\{\text{4 (trust), 3 (weak trust), 2 (weak distrust), 1 (distrust)}\}$ and nine possible advisories \cite{MEA-EB-PK-AL:20-2};
    
%     \item each action $a_i$ is an acceleration $\ddot{h}_i$: we set $A_i=\{0,\pm3.0, \pm 7.33, \pm 9.33, \pm 9.7, \pm11.7\}$. Each advisory will provide \rev{two compliant non-zero actions} for the agent to select from, except which the agent is also allowed to adopt zero acceleration;
    
%     \item $obs_i$ computes an advisory by the previous advisory $ad_i$ and environment state $s_E$: $ad_{\textup{own}}'=obs_{\textup{own}}(ad_{\textup{own}},s_E)=\textup{argmax}(f_{ad_{\textup{own}}}(h,\dot{h}_{\textup{own}},\dot{h}_{\textup{int}},\tau))$ and $ad_{\textup{int}}'=obs_{\textup{int}}(ad_{\textup{int}},s_E)=\textup{argmax}(f_{ad_{\textup{int}}}(-h,\dot{h}_{\textup{own}},\dot{h}_{\textup{int}},\tau))$, where $\textup{argmax}:\mathbb{R}^9\to[9]$ returns the index of the largest component and $f_{ad_i}:\mathbb{R}^4\to\mathbb{R}^9$ is a function implemented via a feed-forward NN with four inputs, seven hidden layers of 45 nodes and nine outputs representing the score of each possible advisory. There are nine NNs $F=\{ f_i:\mathbb{R}^4\to\mathbb{R}^9  \,|\, i \in [9] \}$, each of which corresponds to an advisory;
    
%     \item the local transition function $\delta_i$ computes a belief level according to the current belief level $be_i$, the updated advisory $ad_i'$ and the executed action $a_i$: if $a_i$ is compliant with $ad_i'$ (i.e., $a_i$ is non-zero), then $be_i'=be_i+1$ if $be_i\leq3$ and $be_i'=be_i$ if $be_i=4$; otherwise, $be_i'=be_i-1$ if $be_i\ge2$ and $be_i'=be_i$ if $be_i=1$.
% \end{itemize}
% The environment transition function $\delta_E(s_E,\alpha)$ is defined as: $h'=h-\Delta\tau(\dot{h}_{\textup{own}}-\dot{h}_{\textup{int}})-0.5\Delta\tau^2(\ddot{h}_{\textup{own}}-\ddot{h}_{\textup{int}})$, $\dot{h}_{\textup{own}}'=\dot{h}_{\textup{own}}+\ddot{h}_{\textup{own}}\Delta\tau$, $\dot{h}_{\textup{int}}'=\dot{h}_{\textup{int}}+\ddot{h}_{\textup{int}}\Delta\tau$ and $\tau'=\tau-\Delta\tau$ where $\Delta\tau=1$ is the time step.  

% The reward structure for the ownship is as follows: $r_{\textup{own}}^A(s,\alpha)=0$ and $r_{\textup{own}}^S(s)=|h|/h_{\max}+be_{\textup{own}}/4$, where $h_{\max}$ is the maximal absolute value of all altitudes in the generated game tree. We consider two reward structures for the intruder: $r_{\textup{int}}^A(s,\alpha)=0$, and for the zero-sum case, $r_{\textup{int}}^S(s)=-r_{\textup{own}}^S(s)$ and for the nonzero-sum case, $r_{\textup{int}}^S(s)=|h|/{h_{\max}}+be_{\textup{int}}/4$.

% \gabrieltodo{Would need to add $k$?}

\startpara{Two-Agent Aircraft Collision Avoidance Scenario} Secondly, we consider an NS-CSG model of the VCAS[2] system,   as described earlier in Example~\ref{vcas-example}. We study its equilibria strategies, in contrast to the zero-sum (reachability) properties analysed in \citep{MEA-EB-PK-AL:20}.
%
Fig. \ref{fig:vcas_pos} plots the altitude $h$ for equilibria  and zero-sum strategies when maximising $h$ for a given instant $k$. % in the execution of the model.
%
It can be seen that, with respect to the safety criterion established by \citep{KDJ-MJK:19,MEA-EB-PK-AL:20},
i.e., avoiding a near mid-air collision,
equilibria strategies allow the two aircraft to reach a safe configuration within a shorter horizon, which would be missed by a zero-sum analysis.

% Fig. \ref{fig:vcas_pos} shows the values for the altitude $(h)$ for equilibria  and zero-sum strategies when seeking to maximise for different initial values of $\tau$. When computing these values, for equilibria properties, we consider two reward structures $r^S_{\textup{own}}(s) = r^S_{\textup{int}}(s) = h$ if $\tau=0$ and 0, otherwise. For the zero-sum case, the reward for the intruder is negated. In both cases, action rewards are set to 0 for all state-action pairs, i. e., $r^A_{\textup{own}}(s, \alpha) = r^A_{\textup{int}}(s, \alpha) = 0$, $\forall s \in S, \alpha \in A$.  It is possible to see that, with respect to the safety criterion established by the authors, equilibria strategies allow the two aircraft to reach a safety configuration within a shorter horizon, which would be missed by a strict zero-sum analysis.

% The transition function of the environment $\delta_E(s_E,\alpha)$ is defined as: $h'=h+\Delta\tau(\dot{h}_1-\dot{h}_2)+0.5\Delta\tau^2(\ddot{h}_1-\ddot{h}_2)$, $\dot{h}_1'=\dot{h}_1+\ddot{h}_1\Delta\tau$, $\dot{h}_2'=\dot{h}_2+\ddot{h}_2\Delta\tau$ and $\tau'=\tau-\Delta\tau$ where $\Delta\tau=1$ is the time step.  The reward structure for agent $1$ is as follows: $r_1^A=0$ and $r_1^S(s)=\kappa h^2/(\tau+1)+be_1^2$. We consider two reward structures for agent $2$: $r_2^A=0$ and for the cooperative case, $r_2^S(s)=\kappa h^2/(\tau+1)+be_2^2$ and for the competitive case $r_2^S(s)=-\kappa h^2/(\tau+1)+be_2^2$. We here take $\kappa=0.1$. 
% \input{figures/tex/vcas_strategy}

% \ruitodo{A reward structure which encourages the agent to save fuels when two aircraft are far away from each other: }

We also consider a second reward structure that incorporates the trust level and fuel consumption,
and we vary the agent uncertainty parameters $\epsilon_{i}$
(see the appendix for details).
We also fix a different safety limit of $h=200$.
%
Table \ref{tab:VCAS} shows the altitude and number of violations (times that no advisory is taken) for the generated equilibria.
To give an indication of scalability and performance, we also include the
total number of states in the game unfolding and the time for model construction and algorithm execution for both NE and CE. For this example, both types of equilibria yield the same values for the properties considered.

Finally, we discuss equilibria strategies for different values of the uncertainty parameter $\epsilon_{\textup{own}}$. We find that the agents always comply with the advisory system for smaller initial values of $t$ (time until loss of horizontal separation),
given that reaching safety would be of higher priority. Fig.~\ref{fig:vcas-strategy} (left) illustrates that following the advisories is the best strategy when safety and trust are the priority, as the trust levels $tr_{\text{own}}$ and $tr_{\text{int}}$ of the two agents never decrease from the initial score of $4$.
This changes, however, when both aircraft have a larger horizon to consider. The strategy in Fig.~\ref{fig:vcas-strategy} (right) shows a deviation from the advisory (denoted by value 0 for $a_{\text{own}}$ in state $s^2$), resulting in $tr_{\text{own}}$ dropping to $3$ in $s^3$ with probability $0.9$,
reduced fuel consumption and the safety limit of 200 being approached.

% By examining the generated equilibria strategies (see Appendix,
% we see that  

% In Section~5   

% We also analyse two further aspects of the system,
% the full details of which are in \appxref{sec:appendix-c}.
% First, we use a second reward structure that also incorporates the trust level and consumption.

% Finally, we also considered for different values of the uncertainty parameter $\epsilon_{\textup{own}}$
% and found (see \appxref{sec:appendix-c} for details) that following the advisories is the best strategy when safety and trust are the priority.

%\marta{This is not in the right place and needs to be revised or moved; it is too early to comment on computational performance}

\startpara{Efficiency and scalability} For equilibria computation using %the algorithms described in \citep{MK-GN-DP-GS:21} and \citep{KNPS22}
GBI, which computes locally optimal equilibria, CE are generally considerably faster to compute than NE. This is due to the fact that finding an optimal CE in a state can be reduced to solving a \emph{linear program}, while computing an optimal NE requires finding all solutions of a \emph{linear complementarity problem}. The same, however, is not observed when comparing the performance of FSI on the two types of equilibria. 
This is because a path-based encoding requires a greater number of constraints and variables for CE, and we need to solve nonlinear programs. %Furthermore, we highlight the fact that we incur a higher number of constraints and variable in order to encode CE as, in each state, we need to allocate a number corresponding to the product of the number of available actions for each player
%This happens because, in both cases, we are required to solve a nonlinear program. Furthermore, we highlight the fact that we incur a higher number of constraints and variables in order to encode CE as, in each state, we need to allocate a number corresponding to the product of the number of available actions for each player.

\input{figures/tex/vcas_pos}

\input{figures/tex/vcas_table}

\input{figures/tex/vcas_strategy}

