%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.



%%%%
%%%% Our packages
%%%%

\usepackage{mathtools}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{comment}
\usepackage{commath}  % for differentials
\usepackage{xspace} 
\usepackage[short, nocomma]{optidef}
\usepackage{tikz} 
\usetikzlibrary{decorations.markings}
\usepackage{thmtools}
\usepackage{thm-restate}
\usepackage{float}

\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\algnewcommand\Input{\item[\textbf{Input:}]}
\algnewcommand\Output{\item[\textbf{Output:}]}

\declaretheorem[name=Theorem,numberwithin=section]{theorem}
\declaretheorem[name=Lemma,numberwithin=section]{lemma}

\usepackage[nopostdot]{glossaries}
\usepackage[automake]{glossaries-extra}
\usepackage{glossary-longextra}
\setglossarystyle{long-name-desc}

%%%%
%%%% Our macros
%%%%

\input{notation}
\input{glossary}
\makeglossaries

% Copied from https://tex.stackexchange.com/questions/141570/sizing-for-given-that-symbol-vertical-bar
\makeatletter
\newcommand{\@giventhatstar}[2]{\left(#1\;\middle|\;#2\right)}
\newcommand{\@giventhatnostar}[3][]{#1(#2\;#1|\;#3#1)}
\newcommand{\giventhat}{\@ifstar\@giventhatstar\@giventhatnostar}
\makeatother
%

%\newtheorem{theorem}{Theorem}[section]
\newtheorem{corollary}{Corollary}[section]
%\newtheorem{lemma}{Lemma}[section]
\newtheorem{proposition}{Proposition}[section]

\theoremstyle{definition}
\newtheorem{definition}{Definition}[section]
\newtheorem{example}{Example}[section]
\newtheorem{remark}{Remark}[section]

\newcommand{\df}[1]{{\bf #1}}
\newcommand{\dfi}[1]{{\em #1}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\rg}{rg}
\DeclareMathOperator{\E}{E}

\newcommand{\BAn}{Barab\'{a}si-Albert\xspace}
\newcommand{\ERn}{Erd\H{o}s-R\'{e}nyi\xspace}
\newcommand{\WSn}{Watts-Strogatz\xspace}
%%%%
%%%%
%%%%

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)



\title{General Markov Model for Solving Patrolling Games}

\author[1,2]{Andrzej Nag\'orko}
\author[1,2]{Marcin Waniek}
\author[1]{Ma\l{}gorzata R\'og}
\author[1,2]{\\Micha\l{} Godziszewski}
\author[1]{Barbara Rosiak}
\author[1,2]{\href{mailto:<tomasz.michalak@ideas-ncbr.pl>}{Tomasz P. Michalak}{}}

\affil[1]{%
	IDEAS NCBR\\
	Warsaw, Poland
}
\affil[2]{%
	Faculty of Mathematics, Informatics and Mechanics\\
	University of Warsaw\\
	Warsaw, Poland
}



\begin{document}
\maketitle

\begin{abstract}
Safeguarding critical infrastructure has emerged as a global challenge. Effective mobile security forces are essential to address complex security concerns. A key challenge involves designing optimal patrolling strategies for mobile units. Two bodies of research dealt with this: stochastic patrolling and partially observable stochastic games. Alas, the first approach makes too-far-reaching simplifying assumption and the second one is computationally challenging. %To address this limitation, we propose a model that balances expressiveness with computational feasibility. 
The model proposed in this paper is inspired by partially observable stochastic games, and it enables comprehensive modeling of attacker-defender interactions while remaining computationally friendly. %By combining elements of both stochastic patrolling and partially observable stochastic games, our approach seeks to bridge the gap between theoretical models and practical applications, enhancing security in critical infrastructure defense.
With our robust SHIELD algorithm, we are able to find a defense strategy where the probability of capturing the attacker can be nearly doubled compared to the state of the art.
\end{abstract}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{sec:intro}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In an era of growing connectivity and technological interdependence, protecting critical infrastructure has become a worldwide challenge. Unfortunately, recently rising geopolitical instabilities have made maintaining desirable security level significantly more difficult~\citep{tamilselvan2024exploring}. While in the last decades threats consisted of crime, industrial espionage, or terrorism, nations now have to safeguard critical infrastructure from state-sponsored (hybrid-warfare) attacks, exemplified by incidents in the Baltic Sea~\citep{bueger2023critical}. This issue is exacerbated by advancing technology that allows for more sophisticated attacks in remote locations. This increases the sheer size of the areas that have to be protected. For instance, a single offshore wind farm in the Baltic Sea typically covers an area of about 100 km$^2$, with dozens of wind turbines, offshore transformer stations, and hundreds of kilometers of underwater cables. %including export power lines to various countries. %Currently, there are seventeen such farms in operation and several under constructions.
Last but not least, our understanding of what should constitute critical infrastructure has evolved and it is now much broader~\citep{pursiainen2023european}. As a result, any limited security resources has to be spread even thinner.


%All the above issue make the task of ensuring complete security of critical infrastructure exceedingly difficult, if not impossible. 
An effective security force needs to be mobile, enabling it not only to detect an attack but also to promptly summon an appropriate response. In this context, one of the key issues is to design optimal patrolling strategies for mobile units. Unfortunately,  under realistic assumptions this becomes a challenging game-theoretic problem. Deterministic routes are predictable; thus, they are likely to be exploited by an attacker.
Given this, the literature focused on the so-called stochastic patrolling, where the defender(s) randomize their behaviour~\citep{basilico2022recent}. The time in the model is discretized into turns during which both players, the defender and the attacker, take actions. The attacker observes the moves of the defender and can attack any target at any turn, but the penetration takes a predefined number of turns. If detected within this time, the attack fails.
A recent work in this vein, \cite{john2023rosso}, studies the problem of patrolling San Francisco intersections by a police unit. The authors assume that the defender strategy is a standard Markov decision process and that this strategy is known by the attacker including defender's current position.
Unfortunately, the stochastic patrolling literature introduces a plethora of simplifying assumptions that make its results difficult to apply in a realistic setting. For instance, while the capabilities of the defender and the attacker are in reality asymmetric, in the stochastic patrolling literature they usually make their decision based on the same (or very similar) information.

This limitation can be to some extent addressed by employing partially observable stochastic games (POSGs)~\citep{AIJ2023}. Specifically, POSGs enable explicit modelling of the information asymmetry between the attacker and the defender, allowing one party to observe a smaller subset of the environment. Within this framework, both the attacker and the defender can perform multiple actions, modifying the state of the environment. Moreover, both players make their decisions based on the entire history of the confrontation, with a discount factor giving greater weight to current events. However, all this additional expression power comes at a cost of a greater computational challenge, making POSGs difficult to apply in practice.


%\subsection{Our contributions}

In this work we present a model that combines methods from stochastic patrolling \citep{duan2019markov, duan2021stochastic, john2023rosso} based on a Markov chain control with game-theoretic methods based both on POSGs \citep{AIJ2023} and on Stackelberg games that were successfully applied in many real-world scenarios \citep{pita2009using, shieh2012protect}. For the new model, we construct an effective algorithm to compute an optimal strategy in a scenario when the security incidents are scarce and there is a little interaction between the defender and the attacker, but a successful attack is devastating. In particular, we consider an infinite event horizon with no discount factor and the game value considers a worst-case (instead of average) payoff.
%
Our main contributions are as follows.
\begin{enumerate}
\item Under natural finiteness assumptions, we show that there exists an optimal strategy for the defender that admits a hidden Markov model and we characterize game payoff for such strategies (Theorem~\ref{thm:game value}).
\item We introduce the concept of memory of hidden Markov model that allows to (almost) linearize a highly-nonlinear formula for the game payoff (Theorem~\ref{thm:perspective}).
\item We introduce SHIELD %(Security Heuristic for Intrusion Exposure and Location Defense)
 - an algorithm based on linear programming that computes optimal defender strategies for strategy spaces with a fixed hidden Markov model structure (Section~\ref{sec:lp algorithm}).
\item We prove a non-trivial upper bounds for all strategies that admit hidden Markov models (Theorem~\ref{thm:upper bound}).
\item We perform an extensive experimental evaluation of our approach, which includes a computation of a strategy that has 19.3\% efficiency against 10.2\% that was found in~\cite{john2023rosso}, \emph{under some additional assumptions} about attacker's behaviour (Section~\ref{sec:experimental evaluation}).
\end{enumerate}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/gdynia_diagram}
    \caption{Stylized scenario of defending the port of Gdynia.}
    \label{fig:gdynia}
\end{figure}

As a running example, let us consider a USV (an unmanned surface vehicle) that is to patrol the port of Gdynia in Poland (see Figure~\ref{fig:gdynia}). While introducing the concepts and notation throughout the paper, we will build upon this example to give a better intuition behind the abstract terms.
\begin{example}
Figure~\ref{fig:gdynia} shows the map of the port of Gdynia in Poland. %, situated on the western shore of Gdańsk Bay.
In 2023 it ranked as the third busiest port in the Baltic Sea in terms of container cargo.% (\url{https://www.port.gdynia.pl/en/statistics/}).
 The port also features a passenger terminal and is adjacent to the Gdynia Naval Base. Possible routes for the USV are depicted on the map.
\end{example}


\section{The model}\label{sec:the model}

%We now formally introduce the key elements of our model.

\subsection{Paths in state and action spaces}\label{sec:preliminaries}

A \df{state and action space} is a directed graph $(V, E)$ with a set of vertices~$V$ that are called \df{states} and a set of edges $E \subset V \times V$ that are called \df{actions}. We allow self-loops.

A \df{path} in a state and action space $(V, E)$ is a sequence $(v_0, v_1, \ldots)$ of states such that $(v_i, v_{i+1}) \in E$. A path may be finite (this includes the \df{empty path}~$\epsilon$) or infinite. We let $V^\ast$ denote the set of all finite paths and $\mathcal{V}$ denote the set of all infinite paths in $(V, E)$.

For $p \in V^\ast$, let $|p|$ denote the \df{length} of~$p$ (measured as the length of the sequence of nodes, e.g., a path made of two edges has length $3$).
For $k \in \mathbb{N}$, we let $V^k \subset V^\ast$ denote the subset of all paths of length $k$.

For $p \in V^\ast$ and $q \in V^\ast \cup \mathcal{V}$ we let $pq = p \cdot q$ denote the \df{concatenation} of paths $p$ and~$q$.
% Observe that $pq$ may be not a path.
If $P \subset V^\ast$, $Q \subset V^\ast \cup \mathcal{V}$, then
\[
PQ = P \cdot Q = \left\{ pq \colon p \in P, q \in Q, pq \in V^\ast \cup \mathcal{V} \right\}.
\]
In particular, we let $V^k p = V^k \cdot p$ denote the set of all \emph{paths} that are concatenations of a path of length $k$ with a path $p \in V^\ast$. %, i.e., $V^k p =\{i \cdot p: i \in V^k, i \cdot p \in V^\ast \}$.

A \df{shift operator} $\shift_{\mathcal{V}} \colon \mathcal{V} \to \mathcal{V}$ removes the first state from an infinite path $(v_0, v_1, v_2, \ldots)$, i.e.,
\[
\shift_{\mathcal{V}}((v_0, v_1, v_2, \ldots)) = (v_1, v_2, \ldots).
\]
If $\mathcal{V}$ is known from the context, then we write $\shift$ instead of $\shift_{\mathcal{V}}$.

\subsection{A general formulation}\label{sec:general formulation}
Let $\pspace$ denote a \df{physical state and action space}.
It is a directed graph over which the game is played. 
% We allow self-loops.
We do not make any assumptions about its structure.
Elements of~$\pstates$ are called \df{locations} and elements of~$\pactions$ are called \df{routes}.
Note that we will commonly use a single element of $\pstates$ to represent a location of multiple patrolling units, cf.~Section~\ref{sec:physical space}.

Let $\pstates^\ast$ be the set of finite paths in $\pspace$ which we call \df{histories}.
We let~$\epsilon \in \pstates^\ast$ denote the empty path.
Intuitively, a sequence in $\pstates^\ast$ encodes subsequent positions of the patrolling units  during surveillance. 
We will also interpret elements of $\pstates^\ast$ to be branches in defender's game tree.

Let $\psubshift$ be the set of infinite paths in $\pspace$ which we call \df{patrol schedules} (i.e., we consider the schedules to be extended indefinitely). A \df{defender strategy} is a probability measure $\strategy$ on~$\psubshift$. 
We consider $\psubshift$ to be a measurable space with 
a $\sigma$-algebra of measurable sets generated by the collection of \df{cones}:
\[
\cone{p} = \{ pq \in \psubshift \colon q \in \psubshift \} \text{ for } p \in \pstates^\ast.
\]
%where $pq = p \cdot q$ denotes the \dfi{concatenation} of paths $p$ and~$q$, for $p \in \pstates^\ast$ and $q \in \pstates^\ast \cup \psubshift$ (notice that in the definition of the cone we limit $q$ to the infinite paths).
In other words, the cone $\cone{p}$ of $p \in \pstates^\ast$ is the set of all infinite paths in $\psubshift$ that begin with $p$.

A defender strategy $\strategy$ determines how schedules are generated, (cf. Section~\ref{sec:behavioral strategy}), so eventually we can think that the defender strategy tells us what is the probability that the patrolling unit will follow a given schedule.  

\begin{example}
In the case of the port of Gdynia, the physical and action space $\pspace$ corresponds to the actual physical space of the port, with locations $\pstates$ representing different position in the port area (nodes in the graph on Figure~\ref{fig:gdynia}), and routes $\pactions$ representing transition routes between these positions (edges in Figure~\ref{fig:gdynia}). A history in $\pstates^\ast$ is then a finite path that the USV might take during a daily patrol (probably visiting the same locations multiple times), while $\psubshift$ is the set of patrolling paths that are infinite.
\end{example}

Let $\targets$ be a finite set of \df{attack plans} of the attacker. 
For the sake of generality, we do not impose any additional requirements on $\targets$ at this moment. It can be, e.g., the set of targets chosen by the attacker (together with the time-lengths of attacks on each target), or the set of paths that the attacker plans to take in order to reach chosen targets. 
%In what follows we will focus on the first case above, i.e.,  in the patrolling games instances of our model any attack plan is identified with attacking a given target $j$ for the time period $\tau_j$, but our framework allows for full generality in this respect.

A \df{payoff function} is a map $\payoff \colon \psubshift \times \targets \to \mathbb{R}$. For $j \in \targets$, we let $G_j = G(\cdot, j)$.
We interpret the value of $G_j(p)$ to be the payoff of the defender if the attacker executes an attack plan $j$ at time~$0$ against the patrol schedule $p \in \psubshift$.
%Let $\shift \colon \psubshift \to \psubshift$ denote the \dfi{shift} operator, i.e., the operator that removes the first state from an infinite path.
Let $n \in \mathbb{N}$ and let $\shift^n$ denote the composition of $\shift$ with itself $n$ times.
We let $G^n_j = G_j \circ \shift^n$, $G^n_j \colon \psubshift \to \mathbb{R}$.
We interpret $G^n_j(p)$ to be the defender payoff if the attacker executes attack plan $j \in \targets$ at time $n \in \mathbb{N}$ against the patrol schedule $p \in \psubshift$.

%For $p \in \pstates^\ast$, let $|p|$ denote the length of~$p$ (measured as the length of the sequence of nodes, e.g., a path made of two edges has length $3$).
Let $\pstates^\ast_+ = \pstates^\ast \setminus \{ \epsilon \}$.
We define \df{the game value} for strategy $\strategy$ of the defender to be
\begin{align*}
\gamevalue(\strategy) &= \inf_{i \in\pstates^\ast_+} \min_{j \in \targets} \E\giventhat*{\payoff^{|i|-1}_j}{\cone{i}} \\
&= \inf_{i \in\pstates^\ast_+} \min_{j \in \targets} 
\frac 1{\strategy(\cone{i})} \int_{\cone{i}} \payoff^{|i|-1}_j \dif{\strategy}.
\end{align*}

Here a pair $(i, j) \in \pstates^\ast_+ \times \targets$ is an \df{attacker strategy}, with $i$ being an observation of the attacker (i.e., a sequence of physical states triggering the attack), and $j$ being the attack plan executed as a reaction to observing $i$. Following~\cite{john2023rosso}, we assume the attack begins at the moment when observation $i$ ends. Hence, the game value depends on the last state of history $i$, and we evaluate $G_j$ after discarding $|i|-1$ history states. Therefore, $G^{|i|-1}_j \colon \psubshift \to \mathbb{R}$ is a payoff function of the defender against the strategy $(i,j)$.

\begin{example}
    In the case of the port of Gdynia, the set of attack plans $\targets$ can simply be of the set of docks in the port (the red rectangular nodes in Figure~\ref{fig:gdynia} are the location of the docks, while the blue round nodes are the non-dock locations) if we assume that the attacker is able to directly reach each dock. However, the attack plans $\targets$ may also be paths in $\pspace$ if we assume that the attacker must traverse a path from the port entrance to a given dock. The payoff value $G^{\atime-1}_j(p)$ might express the probability that the attacker launching a strike against target $j$ at time $\atime$ is caught by the USV following a particular patrol route $p$. Similarly, the game value $V(\strategy)$ would be the expected probability of apprehending the attacker by a UAV with schedule generated by strategy $\strategy$, under assumption that the attacker picks the moment of attack $i$ and the target $j$ optimally.
\end{example}



\subsection{The game dynamics}

Informally, the game defined in Section~\ref{sec:general formulation} is played between a dynamic defender and a static attacker.
The defender is \df{dynamic} in the sense that they play an extensive-form game (a game with a sequence of moves and incomplete information) over the game tree $\pstates^\ast$.
The attacker is \df{static} in the sense that they play a normal-form game (a game with a single move and complete information) by picking a single attack plan $j \in \targets$ to be executed when the defender reaches state $i \in \pstates^\ast_+$ in the game tree.
Note that this distinction is not precise:
  the set of attack plans $\targets$ may be very well a set of extensive-form attack strategies transformed into a normal-form with intricacies of player interactions hidden in the definition of the payoff function~$\payoff$.
Nonetheless, this distinction is important in practice: we assume that sets~$\pstates$ and~$\targets$ are not too large, so computations may be performed in a reasonable time.

The definition of the game value $\gamevalue(\strategy)$ accounts for the worst-case scenario for the defender, when the attacker attacks in the worst possible game tree state $i \in \pstates^\ast_+$.
That is, as usual in the security settings, the game value $\gamevalue(\strategy)$ is computed as the \dfi{Stackelberg equilibrium}, where the attacker picks their strategy with full knowledge of defender's strategy $\strategy$.
Note that the infimum $\inf_{i \in \pstates^\ast_+}$ may be not attained for any $i \in \pstates^\ast_+$.
This is known as \df{infinitely patient attacker problem} (cf.~\cite{ICAPS2014}).
%We will analyze this case in detail in Section~\ref{sec:upper bounds}.

The underlying theme of the paper is that the defender patrols some critical infrastructure that is hardly ever attacked.
Therefore, attacker's actions and the payoff value are invisible to the defender until the end of the game: the goal is to prevent or deter the attack and we consider the interaction between the defender and the attacker after the attacker is detected to be a separate sub-game that is modeled in a computation of payoff $\payoff$.

\subsection{Finiteness assumptions}\label{sec:finiteness assumptions}

\subsubsection{History matching}

%In the paper we study strategies designed to beat attackers that base their attack decision on an observation in a fixed finite time horizon. 
The usual assumption in the setting of Stackelberg equilibrium is that the attacker knows the mixed strategy of the defender based on the observation of their past actions.
It is reasonable to ask: how does the attacker learn defender's strategy?
A recent study by~\cite{lanctot2023population} has shown that, among many modern machine learning approaches in a similar (albeit simpler) setting, a technique called \df{history matching} was the most successful. 

Let $t \in \mathbb{N}$ be the current moment in time and let $i \in \pstates^{t+1}$. A  \df{context of length~$k$} for some $k \leq t + 1$ is a sequence $i_{t-k+1}, i_{t-k+2}, \ldots, i_{t-1}, i_t$ of $k-1$ past states that ends at the current state.
History matching tries to match a context to historical data: if $t$ is much greater than $k$, we may look for $t_0 < t$ such that $i_{t_0 - j} = i_{t - j}$ for $j = 0, 1, \ldots, k - 1$, i.e., a sequence of length $k$ from the past matching the last $k$ states.
We want to exploit a possibility that the defender will repeat their past actions if the context matches. 

If the defender properly randomizes their actions, the attacker can only exploit statistical data gathered from observation of the past actions in the same context. 
Since the number of contexts of length $k$ in non-trivial cases grows exponentially with $k$ and since the attacker learns from observation of physical space (which takes time), 
the \df{paradigm of the paper} is to assume that the attacker bases their decision about an attack on an observation of a context of length $k$, for some fixed $k$. 
The notion of observations, on which the attacker bases their decision to execute an attack plan, can be generalized, and its important property is finiteness. 

Note that we treat length $k$ of a context observation to be an inherent attribute of attacker's type and we will construct defender's strategy $\strategy$ against an assumed value of $k$.
This is so because the attacker learns $\mu$ from an observation of defender's action. 
Alas, it is also a defender's weakness.

\subsubsection{Actionable observations}

The discussion of history matching strategy motivates the following: let $\actionable \subset \pstates^\ast_+$ be a finite set that we call a set of \df{actionable observations} for the attacker.
We think of actionable observations as of the ones that can trigger an attack decision, i.e., we can say that $i \in \pstates^\ast_+$ is actionable if upon observing context $i$, the attacker can decide to take action, and otherwise they definitely do not. % make motion to perform an attack plan. 
In other words, after $\theta$ time steps during which the attacker is waiting (absent from the physical state and action space), the attacker observes a sequence $i \in \pstates^\ast_+$, and if this sequence is actionable, the attacker acts conditionally, according to a chosen attack plan in a way that shall be most beneficial for them. 

%An example of $I$ might be the set of sequences of states of length equal to 3. Then, after $\theta$ steps the attacker observes a sequence $i$ of 3 states in $L$, and makes an attack with a target plan $j$ dependent on $i$.

\begin{example}
    In the case of the port of Gdynia, an attacker might consider all observation of a given length $k$ as the set of actionable observations. It would correspond to an attacker who records the activities of the USV for a very long time and attempts to predict its route. For $k=1$ the attacker's reasoning would be: \textit{if the USV is at the moment at dock VII and I start the attack on dock IX, what is the probability that I get caught?} Similarly, for $k=2$ the reasoning would be: \textit{if the USV is at the moment at dock VII, it arrived from dock VIII, and I start the attack on dock VII, what is the probability that I get caught?} It is worth noting, that as the observation length $k$ increases, the attacker needs to conduct their surveillance for a longer time in order to obtain a reasonable approximation of the defender strategy.
\end{example}

We define the \df{game value against $\actionable$} to be 
\[
\gamevalue_\actionable(\strategy) = \inf_{\atime \in \mathbb{N}} \min_{i \in \actionable}
\min_{j \in \targets} \E\giventhat*{G^{\atime + |i| - 1}_j}{\cone{L^\atime i}}.
\]
The value $\gamevalue_I(\strategy)$ denotes the value that is the most beneficial for the attacker amongst the choice of an attack plan and an actionable observation.
Recall that for $\atime \in \mathbb{N}$, we let $\pstates^\atime$ denote the set of paths of length $\atime$ in $\pspace$ and let $\pstates^\atime i$ denote the set of all paths from $\pstates^\atime$ concatenated with $i \in \pstates^\ast$, i.e., $\pstates^\atime i =\{p \cdot i: p \in \pstates^\atime, p \cdot i \in \pstates^\ast \}$. 

%In the patrolling game instance of our model we will focus on situations where the defender performs their surveillance according to a patrolling schedule, and the attacker is at first absent from the physical space  for the first $\atime$ time units of the patrolling. In such a scenario he enters the game space at time $\atime$ and makes an observation $i \in \pstates^\ast$. Thus, the path taken by the defender until the end of the observation of the attacker (and his decision whether to execute an attack plan or abstain from the game) will be $p\cdot i$, where $p \in \pstates^\atime$.

In the following, we let $\actionable = \pstates^k$ for $k \in \mathbb{N}$, i.e., we consider attackers that base their attack decision on an observation of a context of length $k$.
Note that a setting with $k = 1$ was considered in~\cite{john2023rosso} and an arbitrary $k$ was allowed in~\cite{aamas09}.
A distinctive feature of the present paper is to allow the defender to have a hidden state, so $\strategy$ may depend on a context that is longer than $k$.

\begingroup
\makeatletter
\apptocmd{\thelemma}{\unless\ifx\protect\@unexpandable@protect\protect\footnote{Proofs of all theorems are given in the Appendix.}\fi}{}{}
\makeatother

\begin{restatable}{lemma}{robustnesslemma}\label{lem:robustness}
We have
$
\gamevalue(\strategy) = \inf_{I \subset \pstates^\ast_+} \gamevalue_\actionable(\strategy),
$
so in particular 
$
\gamevalue(\strategy) \leq \gamevalue_\actionable(\strategy)
$
for each $I \subset \pstates^\ast_+$.
\end{restatable}

\endgroup


We say that strategy $\strategy$ constructed against $\actionable$ is \df{robust} if $\gamevalue(\strategy) = \gamevalue_I(\strategy)$.
Let $\strategy^\ast$ be an \df{optimal defender strategy}
and let $\strategy^\ast_\actionable$ be an \df{optimal defender strategy against $\actionable$}.
Note that an optimal defender strategy against $\actionable$ may be not robust, i.e. it is possible that $\gamevalue(\strategy^\ast) > \gamevalue(\strategy^\ast_\actionable)$.
% We will analyze this case in Section~\ref{sec:soundness}.

\subsubsection{Bounded attack resolution time}\label{sec:behavioral strategy}\label{sec:bounded attack time}

For $p \in \pstates^\ast$, we let 
$
\behavioral(p) = \strategy(\cone{p})
$.
It is the probability that the defender will follow history $p$.
%If $p, q \in \pstates^\ast$ and $pq \not\in \pstates^\ast$, then we let $\behavioral(pq) = 0$.
A \df{behavioral strategy} of the defender (cf. ~\cite[Definition 3.2]{AIJ2023}) is a map $\behavioral\giventhat*{\cdot}{\cdot} \colon \pstates \times \pstates^\ast \to [0, 1]$ defined by the formula 
$
\behavioral\giventhat*{q}{p} = \frac{\behavioral(pq)}{\behavioral(p)}
$
for $p \in \pstates^\ast$ and $q \in \pstates$, undefined if $\behavioral(p) = 0$ or if $pq$ is not a path. 

A behavioral strategy $\behavioral\giventhat*{\cdot}{\cdot}$ uniquely determines measure $\strategy$ and may be used to sample an element  $p \in \psubshift$ according to $\strategy$, by recursively sampling the next state $p_{k+1}$ according to the distribution $\behavioral\giventhat*{p_{k+1}}{p_0 p_1 \cdots p_{k-1}p_k}$.

We assume \df{a bounded attack resolution time}, i.e. that for each $j \in \targets$ there exists $\tau_j \in \mathbb{N}$ such that for each $p, q \in \psubshift$ 
\[
\text{if } (p_0, \ldots, p_{\tau_j - 1}) = (q_0, \ldots q_{\tau_j - 1}),
\text{ then } G_j(p) = G_j(q).
\]
In other words, the payoff $\payoff_j$ depends only on $\tau_j$ initial states of a patrol schedule.
In such case we say that an attack plan $j \in \targets$ \df{resolves within $\tau_j$ turns} (i.e. $\tau_j - 1$ \emph{time steps}). Using this assumption we may define $\payoff_j$ on $\pstates^{\tau_j}$.
Let $p \in \pstates^{\tau_j}$. 
We define $\payoff_j(p)$ to be $\payoff_j(pq)$ for any $q \in \psubshift$ such that $pq \in \psubshift$.
We assume that there are no dead ends in $\pspace$, i.e. that every path can be indefinitely extended.

\begin{example}
    In the case of the port of Gdynia, the value of $\tau_j$ would correspond to the time necessary to complete the attack on target $j$. For example, the time might be greater for the docks situated deeper into the port if the attacker has to actually traverse the path between the port entrance and their desired target.
\end{example}

\begin{restatable}{lemma}{gamevaluelemma}\label{lem:game value}
Let $\atime \in \mathbb{N}$, $i \in \pstates^\ast_+$ and $j \in \targets$
and let $\strategy$ be a defender strategy that induces strategy $\behavioral(p) = \strategy(\cone{p})$.
Assume that an attack plan $j \in \targets$ resolves within $\tau_j$ turns.
We have
\begin{align*}
& \E\giventhat*{G_j^{\atime + |i| - 1}}{\cone{L^\atime i}} = \\
& = \frac 1{\sum_{p \in L^\atime i} \behavioral(p)} \sum_{p \in L^\atime i L^{\tau_j-1}} \behavioral(p) \payoff^{\atime + |i| - 1}_j(p).
\end{align*}
\end{restatable}



\subsubsection{Time-invariance}

Intuitively, a defender's strategy that depends on time (or in other words - a behavioral strategy that depends on the depth in the game tree) may be vulnerable to a properly timed attack.
In the definition of $\gamevalue_\actionable(\strategy)$, the attacker picks a time of attack $\atime \in \mathbb{N}$ that is most favorable to him.
Therefore in the present paper we restrict our attention to defender strategies $\strategy$ that are \df{$\shift$-invariant} or \df{time-invariant}, i.e. strategies such that
$
\strategy(L\cdot A) = \strategy(A)
$
for every measurable set $A \subset \psubshift$ and for the set of locations $L$.
An important example of time-invariant measure is a push-forward of a Markov measure that is introduced in Section~\ref{sec:hidden markov model}.

\begin{example}
    In the case of the port of Gdynia, a shift-invariant defender strategy would select the next destination of the USV based on the finite number of previous actions, without looking indefinitely far into the past.
\end{example}

We say defender strategy $\strategy$ is \df{discrete} if the range of the behavioral strategy, $\left\{ P\giventhat*{s}{p} \colon p \in \pstates^\ast, s \in \pstates \right\}$, is finite. Otherwise we say that $\strategy$ is \df{continuous}.
Note that for each $\actionable \subset \pstates^\ast_+$ and for each discrete defender strategy $\strategy$ there exists a time-invariant strategy $\nu$ such that
$
\gamevalue_\actionable(\nu) \geq \gamevalue_\actionable(\strategy).
$
However, we do not know if every continuous defender strategy may be approximated by a discrete strategy with a close game value against $I$.

\begin{comment}
\begin{restatable}{theorem}{discretetohidden}\label{thm:existence of hidden markov model}
Let $I \subset \pstates^\ast_+$ be a finite set of actionable observations.
For each discrete defender strategy $\strategy$ there exists a
hidden Markov model $\nu$ such that
\[
\gamevalue_I(\nu \circ \projection^{-1}) \geq \gamevalue_I(\strategy).
\]
\end{restatable}
\end{comment}

If strategy $\strategy$ is $\shift$-invariant, then $\gamevalue_I(\strategy)$ can be computed by the following formula that involves only a finite set of parameters and a minimization over a finite set.

\begin{restatable}{lemma}{invariantgamevalue}\label{lem:invariant game value}
Let $\atime \in \mathbb{N}$, $i \in \pstates^\ast$ and $j \in \targets$
and let $\strategy$ be a defender strategy that induces strategy $\behavioral(p) = \strategy(\cone{p})$. 
Assume that an attack plan $j \in \targets$ resolves within $\tau_j$ turns.
If $\strategy$ is $\shift$-invariant, then
\begin{align*}
\E\giventhat*{G_j^{\atime + |i| - 1}}{\cone{L^\atime i}}
= \frac 1{\behavioral(i)} \sum_{p \in i L^{\tau_j-1}} \behavioral(p) \payoff^{|i| - 1}_j(p).
\end{align*}
\end{restatable}

\begin{definition}\label{def:probability}
For $i \in \pstates^\ast_+$, $p \in \pstates^\ast$ we let
\[
\behavioral(i \sim p) =
\left\{
\begin{array}{ll}
%P(p) & \text{ if } i = \epsilon,\\
% \infty & \text{ if } P(i) = 0,\\
\frac{\behavioral(i\shift(p))}{\behavioral(i)} & \text{ if } p_0 = i_{|i| - 1}, \\
0 & \text{ if } p_0 \neq i_{|i| - 1}.
\end{array}
\right.
\]
\end{definition}
Using $\atime$ to denote $|i| - 1$, $\behavioral(i \sim p)$ 
is a conditional probability that the patrol schedule from time $\atime$ is equal to $p$ under the condition that from time $0$ it equals $i$.
Observe that these intervals overlap: the last state of $i$ has to be equal to the first state of $p$, otherwise the probability is $0$.
Note that $P(i \sim p)$ is undefined if $P(i) = 0$.

\begin{restatable}{theorem}{gamevaluetheorem}\label{thm:game value}
Assume that an attack plan $j \in \targets$ resolves within $\tau_j$ turns.
If $\strategy$ is $\shift$-invariant, then the game value
against $I \subset \pstates^\ast_+$ is equal to
\[
\gamevalue_I(\strategy) = \min_{i \in I} 
\min_{j \in \targets} \sum_{p \in L^{\tau_j}} \behavioral(i \sim p) \payoff_j(p).
\]
\end{restatable}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{A patrolling game}\label{sec:patrolling games}

In this section, we describe a specific method of constructing a physical space $\pspace$, attack plans $\targets$ and payoff functions $\payoff_j$.
What gets constructed this way is a \df{patrolling game} which is a crucial instance of our general model.

\subsection{A patrolling setting}\label{sec:setting}

A patrolling setting (environment) is a model of protecting critical infrastructure viewed statically, before any dynamic interplay between the defender and the attacker is taken into account.
The setting consists of the following:
\begin{enumerate}
\item A set $\units$ of patrolling units, a set $\targets$ of protected targets (corresponding to the attack plans), and a set of defender's values of targets $\values \colon \targets \to \mathbb{R}$.
\item  A directed graph $(\locations_u, \routes_u)$ defined for each patrolling unit $u \in \units$. The graph describes the topology of the critical infrastructure that we protect, and consists of: the set of locations that are being patrolled $\locations_u$, and the set of connecting routes (edges) $\routes_u \subset \locations_u \times \locations_u$.

We allow self-loops in $\routes_u$. Each route has its length $\length_u \colon \routes_u \to \mathbb{N}_+$. The lengths can vary between the edges, and are specified in time units. We can think of them as depending both on physical distance between locations and on speed of the patrolling unit.
\item A coverage function $\coverage_u \colon \locations_u \times \targets \to [0, 1]$ defined for each patrolling unit $u \in \units$. For a patrolled location $l \in \locations_u$ and a target $t \in \targets$, the function $\coverage_u(l, t)$ is the probability that the patrolling unit $u$ stationed at location $l$ will catch an intruder within a single unit of time while he attacks target~$t$.

\end{enumerate}

It is easier to understand the coverage function when it is binary-valued. Then, for each unit $u$ the set $\{t\in T: \coverage_u(l,t)=1\}$ can be interpreted as the targets protected by the unit $u$ from the vertex $l$. We generalize this notion to take into account the possibility of imperfect target protection.

\begin{example}
    In the case of the port of Gdynia, the set $\units$ might consist of a USV and a UAV (unmanned aerial vehicle). These two patrolling units might then have different patrolling routes, thus different $(\locations_u, \routes_u)$ graphs (e.g., the UAV could fly directly between any two locations, while the USV can only travel on water). The coverage function could express the fact that a patrolling unit situated at location $l$ corresponding to a given dock $t$ fully protects it ($\coverage_u(l,t)=1$), while also partially protecting dock $t'$ whose corresponding location is connected to $l$ with an edge ($\coverage_u(l,t')=\frac{1}{2}$). The value $\values$ might be greater for military docks, and smaller for the civilian ones.
\end{example}



\subsection{A physical state and action space}\label{sec:physical space}

A \df{physical space} $(\pstates, \pactions)$ (mentioned in Section \ref{sec:general formulation}), representing the dynamics of the defensive force,  provides a unified framework where a single state uniquely represents an arrangement of multiple defensive resources and each action takes a single unit of time. 
We construct $(\pstates, \pactions)$ and a coverage function $\coverage \colon \pstates \times \targets \to [0, 1]$, using patrolling environment data specified in Section~\ref{sec:setting}. 

First, for each unit $u \in \units$, we get rid of the length function $\length_u$ by subdividing long edges of the graph $(\locations_u, \routes_u)$ into several edges of unit length. The procedure is detailed in Appendix~\ref{sec:long edge subdivision}.
We extend the coverage function $\coverage_u$ to intermediate vertices by setting coverage to $0$, i.e. no target is protected when unit is in an intermediate state.
Note that any other extension would work with our method, e.g. a linear interpolation of coverage from both ends of the long edge. 

Then, the physical space $(\pstates, \pactions)$ is defined to be a tensor product of subdivided graphs.
A coverage function $\coverage \colon \pstates \times T \to [0, 1]$ is defined by the formula
\[
\coverage(v, t) = 1 - \prod_{u \in U} (1 - \coverage_u(\pi_u(v), t)),
\]
where $\pi_u \colon \pstates \to \locations_u$ denotes a projection from the product graph onto its $u$-th factor, i.e., selecting the location of the unit $u$ from the vector of all unit locations. 
While in our formula we assume that each patrolling unit has an independent chance to catch the intruder, any other joint distribution of coverage functions would work.

\begin{example}
    In the case of the port of Gdynia, the states $\pstates$ of the physical space could correspond to the pairs of the position of the USV and the position of the UAV, while the actions $\pactions$ to the transitions of both units to new positions. Assume that each unit provides coverage $1$ for the location where it is positioned, and $\frac{1}{2}$ to adjacent locations. The coverage function $\coverage((l_1,l_2),t)$ would then have the value of $1$ for dock $t$ corresponding to either $l_1$ or $l_2$, $\frac{3}{4}$ for docks adjacent to both $l_1$ and $l_2$, $\frac{1}{2}$ for docks adjacent to either $l_1$ or $l_2$ but not both, and $0$ for all other docks.
\end{example}

We can think of the physical space, $(\pstates, \pactions)$, as a board on which a game between the defender and the attacker is played.
We consider a \df{static} attacker who commits to a single decision to perform an attack on a target $j \in \targets$.
Let $\tau \colon \targets \to \mathbb{N}_+$ denote the \df{attack duration} of targets. Once the attacker commits to attack $j \in \targets$, the defender has $\tau_j$ turns to catch them. The defender patrols the locations according to a patrolling schedule $p \in \psubshift$.
The probability that the defender will successfully defend target $j \in \targets$ is:
\[
D_j(p) = 1-\prod_{t=0}^{\tau_j - 1} \left(1-\coverage(p_t, j)\right).
\]
The formula is based on an assumption that at each moment of time the patrolling units have independent chance to capture the attacker. Like before, this assumption is not essential and the method presented in the paper will work with any joint distribution. Let $\values \colon \targets \to \mathbb{R}$ denote the \df{value} of targets, equal for both players. We assume that the game is constant-sum and this assumption is essential in our paper.
Defender's payoff depends both on $p$ and $j$ and is equal to 
\[
\payoff_j(p) = \values(j) D_j(p).
\]



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{A strategy space}\label{sec:memory state and action space}

\subsection{A hidden Markov model}\label{sec:hidden markov model}

The key idea of our work is to let the defender's hidden state be more elaborate than the position of their patrolling units, thus allowing a more complex behavior and giving advantage against certain types of attackers.
To this end, we let $\sspace$ be a \df{strategy state and action space}, used by the defender to control their defensive resources.
We equip $\sspace$ with a graph homomorphism $\projection \colon \sstates \to \pstates$, called \df{projection}.
We use $\projection$ to translate defender's actions in the strategy space (internal scheduling) to actions in the physical space $\pspace$ (movement of patrolling units). 

To explain the paradigm of a hidden Markov model let us consider an example shown in Figure~\ref{fig:star graph}.
The physical space has $4$ locations connected into a star-shaped graph with a center node $x$ and three leaves $\targets = \{ a, b, c \}$ that are possible targets of attack.
There is a single patrolling unit.
Number of turns to attack each target is equal to $4$, $\tau_a = \tau_b = \tau_c = 4$. 
Each target has value $1$ and the patrolling unit protects only the node that it is visiting.
Consider an attacker with observation length~$1$, i.e. consider a set $\actionable = \{ a, b, c, x \}$ of actionable observations.
We look for a strategy $\strategy$ that maximizes $\gamevalue_\actionable(\strategy)$.

First, consider a case where the patrolling unit moves from the center node $x$ to each of leaf nodes $a, b, c$ with uniform probability $\frac 13$.
This is a Markov chain model that was considered previously in the literature \citep{john2023rosso}.
One of optimal attacker strategies against it is to attack node $c$ when the patrolling unit is observed at node $a$; the probability that the attack will be intercepted is then equal to $\frac 13$.

A better strategy for the defender against this type of attacker is to never visit the same leaf node twice in a row: if the patrolling unit arrived at position $x$ from leaf $a$, then it should move either to $b$ or to $c$ with probability $\frac 12$.
An optimal attacker strategy does not change, but the probability that the attack will be intercepted increases to $\frac 12$.
Note that this strategy is robust: the payoff will remain the same if we add elements to the set of actionable observations.
The reason behind this is that even if we increase the observation length of the attacker, allowing them to distinguish between the strategy states of the defender, they cannot avoid capture with probability greater than $\frac 12$. Their set of strategies is then: attacking the peripheral node that the defender just left, attacking one of the other peripheral nodes than the one that the defender just left, or attack one of the other peripheral nodes than the one that the defender is at right now. All of these strategies result in getting captured with probability $\frac 12$.
Therefore, by Lemma~\ref{lem:robustness}, this is an optimal strategy for the defender.

This strategy may be realized by a Markov chain on a strategy space $\sspace$ (see Figure~\ref{fig:star graph}). The projection $\projection$ translates actions in the strategy space into actions in the physical space~$\pspace$, i.e., into the movements of the patrolling unit.

\begin{figure}[H]
    \centering
    \begin{tikzpicture}[
    Circ/.style={circle, draw, inner sep=2pt, minimum size=2pt},
    MidArrow/.style={
        draw, postaction={decorate,decoration={markings,mark=at position 0.9 with {\arrow{>}}}}}
    ]
    \begin{scope}[shift={(0:-2.5)}]
        \node[] at (-1.5,1.5) {$(S,A)$};
        
        \node[Circ](Centre1) at (0:0.5) {};
        \node[Circ](Centre2) at (0:0) {};
        \node[Circ](Centre3) at (180:0.5) {};
        \node[Circ](Arm1) at (320:2) {};
        \node[Circ](Arm2) at (90:1.5) {};
        \node[Circ](Arm3) at (220:2) {};
    
        \draw[MidArrow] (Arm1) to [bend right=40] (Centre1) {};
        \draw[MidArrow] (Centre2) to node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm1) {};
        \draw[MidArrow] (Centre3) to [bend right=20] node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm1) {};
        \draw[MidArrow] (Centre1) to [bend right=20] node[circle, fill=white, inner sep = 2, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm2) {};
        \draw[MidArrow] (Arm2) to (Centre2) {};
        \draw[MidArrow] (Centre3) to [bend left=20] node[circle, fill=white, inner sep = 2, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm2) {};
        \draw[MidArrow] (Centre1) to [bend left=20] node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm3) {};
        \draw[MidArrow] (Centre2) to node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 12$} (Arm3) {};
        \draw[MidArrow] (Arm3) to [bend left=40] (Centre3) {};
    
        %\draw[MidArrow] (Arm1) to (Centre1) {};
        %\draw[MidArrow] (Arm2) to (Centre2) {};
        %\draw[MidArrow] (Arm3) to (Centre3) {};

        \draw[dashed] circle[x radius=0.75,y radius=0.3];
    \end{scope}

    \draw[->] (-1,0) -- (0,0) node[anchor=south, inner sep=4pt]{$X$} -- (1,0);
    
    \begin{scope}[shift={(0:2)}]
        \node[] at (1, 1.5) {$(L,R)$};
    
        \node[Circ](Centre) at (0:0) {$x$};
        \node[Circ](Arm1) at (300:1.5) {$a$};
        \node[Circ](Arm2) at (90:1.5) {$b$};
        \node[Circ](Arm3) at (240:1.5) {$c$};
    
        \draw[MidArrow] (Centre) to [bend left=20] node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 13$} (Arm1) {};
        \draw[MidArrow] (Centre) to [bend left=20] node[circle, fill=white, inner sep = 1, fill opacity=.9,text opacity=1] {$\frac 13$} (Arm2) {};
        \draw[MidArrow] (Centre) to [bend left=20] node[circle, fill=white, inner sep = 0, fill opacity=.9,text opacity=1] {$\frac 13$} (Arm3) {};
    
        \draw[MidArrow] (Arm1) to [bend left=20] (Centre) {};
        \draw[MidArrow] (Arm2) to [bend left=20] (Centre) {};
        \draw[MidArrow] (Arm3) to [bend left=20] (Centre) {};


        \draw[dashed] circle[radius=0.3];
    \end{scope}
    \end{tikzpicture}
    \caption{A strategy space (left) over a star-graph with three leaves (right). The hidden states over the center allow for construction of more sophisticated strategy that increases a payoff from $\frac 13$ to $\frac 12$ against an opponent making attack decision based on the position of the single patrolling unit.}
    \label{fig:star graph}
\end{figure}

To define a hidden Markov model, let $\mc$ be a Markov chain on $\sspace$ with a \dfi{transition matrix} $\mcmatrix$ and a \dfi{stationary distribution} $\mcstationary$.
Let $\ssubshift$ be a set of infinite paths in $\sspace$, let $\mcmeasure$ be a \dfi{Markov measure} induced by $\mc$ on $\ssubshift$ (cf. \cite[Definition 1.8]{sarig2009lecture}), and let $\strategy = \nu \circ X^{-1}$ be a \dfi{push-forward} of measure $\nu$ from $\ssubshift$ to $\psubshift$.
Measure $\strategy$ is a defender strategy and we call $\nu$ a \df{hidden Markov model} for $\strategy$. 

Note that by~\cite[Proposition 1.8]{sarig2009lecture} the Markov measure $\mcmeasure$ is $\shift_\ssubshift$-invariant, so its push-forward $\strategy = \mcmeasure \circ \projection^{-1}$ is $\shift_\psubshift$-invariant.
Thus, Theorem~\ref{thm:game value} applies to defender strategies with hidden Markov models. 
The following lemma relates measure $\strategy = \mcmeasure \circ \projection^{-1}$ to the transition probabilities~$\mcmatrix$ and the stationary distribution~$\mcstationary$ of the hidden Markov chain.

\begin{restatable}{lemma}{probabilitycomputation}\label{lem:probability computation}
Assume that $\strategy$ has a hidden Markov model with a stationary distribution $\mcstationary$, transition matrix $\mcmatrix$ and projection
$\projection \colon \sstates \to \pstates$.
Let $\projection_\ast \colon \sstates^\ast \to \pstates^\ast$ be a natural map induced by $\projection$, i.e., the element-wise application of $\projection$.
Then for each $p \in \pstates^\ast$ we have
\[
\behavioral(p) = \strategy(\cone{p}) = \sum_{q \in \projection_\ast^{-1}(p)}
\mcstationary_{q_0} \prod_{i=0}^{|p|-2} N_{q_i, q_{i+1}}.
\]
\end{restatable}



\subsection{A space with memory}\label{sec:memory}

In this section, we introduce the key concept of the paper -- \df{state and action spaces with memory}.
%For a state $s \in S$ and an action $(s, s') \in A$ we let $s \cdot a = s'$.
%This is a concatenation of state $s$ with action $a$ and by definition it applies $a$ to $s$.
%If the source state of action $a$ is equal to $s$, then we say that $s$ and $a$ are compatible and $s \cdot a$ is well defined.
%Otherwise it is undefined.
%If we write $s \cdot a_1 \cdot a_2 \cdot \ldots \cdot a_t$, then we implicitly assume that $s$ is compatible with $a_1$, $s \cdot a_1$ is compatible with $a_2$ and so on.
Let $\sspace$ be a strategy state and action space.
Let $Z$ be a function on $S$ with arbitrary range $\rg(Z)$.
We say that $(S, A)$ has \df{memory of length $t$ with respect to $Z$} (where $t \geq 1$) if the following condition holds for each pair of states $r, s \in S$: 
\begin{quote}
if $Z(r) \neq Z(s)$, then for each pair of paths $r r_1 r_2 r_3 \cdots r_{t-1}$, $s s_1 s_2 s_3 \cdots s_{t-1}$ of length $t$ we have $r_{t-1} \neq s_{t-1}$. 
\end{quote}
In other words, if $Z$ differentiates states $r$ and $s$, then the internal defender's state after any $t-1$ actions will still be different.
% We say that $(S, A)$ has \df{memory of length $t$ with respect to $Z$} if the following condition holds.
% \begin{quote}
%   For each pair of states $r, s \in S$ if $Z(r) \neq Z(s)$, then for each pair of paths
%   $r r_1 r_2 r_3 \cdots r_t$, $s s_1 s_2 s_3 \cdots s_t$ of length $t$ that start at $r$ and $s$ respectively we have $r_t \neq s_t$,
%   i.e. if $Z$ differentiates states $r, s$, then the internal defender's state after any $t$ actions will still differentiate states $r, s$.
% \end{quote}
%Recall that a path in $\sstates$ is a sequence such that every pair of consecutive states in the path is an action, i.e. is an element of $\sactions$.
The principle might be easier to understand in its contrapositive form: for all $r, s \in S$, if for some paths $r r_1 r_2 r_3 \cdots r_{t-1}$ and $s s_1 s_2 s_3 \cdots s_{t-1}$ we have $r_{t-1} = s_{t-1}$, then $Z(r) = Z(s)$. Thus, the current internal state $s \in S$ of the defender determines uniquely what attacker's observation was $t-1$ steps ago. In other words, the states in $S$ contain the information about the last $t$ observations of the attacker. Note that it does not force the state space to be large, as is shown in Appendix~\ref{sec:construction}, together with a couple of other methods and examples of constructions of spaces with memory.

% To see better the meaning of this principle, consider the condition in its contrapositive form: for all $r, s \in S$, if for some paths $r r_1 r_2 r_3 \cdots r_t$ and $s s_1 s_2 s_3 \cdots s_t$ we have $r_t = s_t$, then $Z(r) = Z(s)$.
% Thus, the reading of the condition above is that the current (internal) state $s \in S$ of the defender determines uniquely what attacker's observation was $t$ steps ago. In other words, the states in $S$ are constructed in such a way that 
% the information about the last $t$ observations of the attacker is actually contained
% in every state $s \in S$, being its integral part.  
% Note that it does not force the state space to be large as we will show in Section~\ref{sec:construction}.
Let $M_Z \colon S \times \{0, \ldots, t-1\} \to \rg(Z)$, where $\rg(Z)$ denotes the range of $Z$, be a \textbf{memory function} that maps a pair $(s, i)$ to the past value of $Z$, i.e., its value $i$ steps before the defender reached the state $s$. Lemma~\ref{lem:his} shows that $M_Z$ is properly defined.

\begin{restatable}{lemma}{memoryfunction}\label{lem:his}
Assume that the $(\sstates, \sactions)$ has memory of length $t$ with respect to $Z$ for $t \geq 1$ and that each $s \in S$ has at least one incoming edge. Then the memory function $M_Z \colon \sstates \times \{ 0, \ldots, t-1 \} \to \rg(Z)$ that satisfies
\begin{align*}
\text{for each } s \in \sstates \text{ and each } p \in \sstates^t \text{ such that } p_{t-1} = s \\
\text{ we have } M_Z(s, i) = Z(p_{t-1-i})
\end{align*}
is well-defined. 
\end{restatable}



\subsection{Constructing spaces with memory}

In our approach, a strategy state and action space is a space with memory of length $t$ with respect to projection $\projection$.
Such spaces may be constructed in several ways.
See Appendix~\ref{sec:construction} for a more detailed discussion.

The most straightforward approach is to construct a \df{space of paths}, where each state is a path of length $t$ in the original physical graph. 
Such a space may be endowed with additional states: a tensor product of a space with memory $t$ with an arbitrary graph is again a space with memory $t$. 
A space of paths may be filtered by heuristics, e.g., we may consider only simple paths as elements of the strategy space.

Interestingly, a number of states in a space with memory $t$ doesn't have to be large, as is seen in the construction of \df{space of disjoint cycles}. Such a space replicates the usual approach to patrolling with Stackelberg games in matrix form, cf.~\citep{shieh2012protect}.

\subsection{A lift of attacker's observation}\label{sec:lift of observation}

We assumed that attacks are triggered by histories sampled from $\pstates^\ast i$, where $i \in \actionable$ is an actionable observation.
We now assume that the actionable observations are the sequences of length $\olength$ from $\pstates$, i.e., $\actionable=\pstates^\olength$ and we let $\olength \in \mathbb{N}$ be an \df{observation length}.

Let $\projection$ be a projection from $\sspace$ to $\pspace$ defined in Section~\ref{sec:hidden markov model}.
Let $\observation_\olength \colon \pstates^\ast \cdot \pstates^h \to \pstates^\olength$ defined by 
\[
\observation_\olength(p) = (p_{|p| - h}, \ldots, p_{|p|-2}, p_{|p|-1})
\]
be a \df{context of length $\olength$} of history $p \in \pstates^\ast$ such that $|p| \geq \olength$. Intuitively, $\observation_\olength(p)$ selects the last $h$ elements of $p$.

If the strategy space $\sspace$ has a memory of length at least $\olength$ with respect to $\projection$, then we may \df{lift} the observation $\observation_h$ to be a function of $s \in \sstates$: 
\[
\lift_h(s) = (M_X(s, h-1), M_X(s, h-2), \ldots, M_X(s, 0)),
\]
where $M_X$ is the memory function defined in Section~\ref{sec:memory}. Intuitively, $\lift_h(s)$ produces the last $h$ states of the physical space $\pspace$ based on the current state $s$ of the strategy space $\sspace$.

Consider $s \in \sstates^\ast$, a sequence of internal strategy states. Let $p \in L^*$ be the result of applying projection $X$ to each element of $s$, i.e., $p = X_\ast(s)$ using notation introduced in Lemma~\ref{lem:probability computation}. 
Then $\observation_h(p) = \lift_h(s_{|s|-1})$, i.e., the history of length $h$ in the physical space is encoded by the \emph{last} state $s_{|s|-1}$ in the strategy space.

Motivated by the above property, we define an \df{attacker's observation function} to be a function $\observation \colon \sstates \to \actionable$, where $\actionable = \pstates^\olength$ is the attacker's set of actionable observations. 
Notice the assumption that the strategy space $\sspace$ has memory at least $\olength$, 
i.e., the internal space of the defender is rich enough to encompass the attacker's actionable observations.

Note that the dependence of $\observation$ on $\sstates$ does not mean that the attacker observes the internal state of the defender. 
It means that the strategy state space $\sstates$ is complex enough to reconstruct the attacker's observation.
This assumption is very useful from the technical point of view, as it simplifies the description of the model and its solution. 

\subsection{A switch of perspective}\label{sec:switch of perspective}

We now introduce a crucial formula that switches perspective from the future into the past, allowing us to (almost) linearize a highly non-linear formula given in Theorem~\ref{thm:game value} and Lemma~\ref{lem:probability computation}.

\begin{restatable}{lemma}{switchlemma}\label{lem:switch}
Let $i \in \pstates^\ast$ and $t \in \mathbb{N}$.
Let 
\[
\hat H_{i, t} = \left\{ s \in \sstates \colon \left(\projection^{-1}_\ast(i) \cdot \sstates^t \right) \cap \left( \sstates^\ast \cdot s \right) \neq \emptyset \right\}
\]
be a set of all states in strategy space that are reachable after following path $i$ in the physical space and continuing for $t$ time steps.
If $\sspace$ has a memory of length $|i| + t$ with respect to $\projection$ and $\behavioral$ has a hidden Markov model with stationary distribution $\sigma$, then
$
\behavioral(i) = \sum_{s \in \hat H_{i, t}} \sigma_s.
$
\end{restatable}

Assume that the strategy space $\sspace$ has memory of length at least $\tau_j$ with respect to $\projection$ and $\observation$, where $\tau_j$ is the duration of the attack plan $j$, while $\projection$ and $\observation$ are defined in Section~\ref{sec:lift of observation}. Notice that the payoff function $G_j$ of the attack plan $j$ depends only on the last $\tau_j$ states of the physical space. Hence, we can lift the payoff function $G_j$: 
\[
\widetilde\payoff_j(s) = G_j(M_X(s, \tau_j - 1), M_X(s, \tau_j-2), \ldots, M_X(s, 0)),
\]
where $\widetilde\payoff_j(s) \colon \sstates \to \mathbb{R}$, and $M_X$ is a memory function. Moreover, for $i \in \actionable$ let 
\[
H_{i, t} = \{ s \in \sstates \colon M_\observation(s, t) = i \},
\]
i.e., $H_{i, t}$ is the set of states where $t$ time units ago the attacker's observation returned an actionable observation $i$. 

\begin{restatable}{theorem}{perspective}\label{thm:perspective}
Assume that $\sspace$ has memory of length $\max_{j \in \targets} \tau_j$ with respect to $\projection$ and $\observation$, and that the defender strategy $\strategy$ has a hidden Markov model with a stationary distribution $\sigma$.
Then
\[
\gamevalue_\actionable(\mu) = \min_{i \in I} \min_{j\in T}
\frac {\sum_{s \in H_{i, \tau_j - 1}} \sigma_s \widetilde \payoff_j(s)}
{\sum_{s \in H_{i, \tau_j - 1}} \sigma_s}.
\]
\end{restatable}



\section{An upper bound theorem}\label{sec:upper bounds}

Now we prove an upper bound theorem for patrolling games introduced in Section~\ref{sec:patrolling games}.
As a corollary we obtain a method of computing upper bounds via the linear problem~\ref{lp:upper bound}, which vastly generalizes methods that exist in the literature. 

\begin{restatable}{theorem}{upperbound}\label{thm:upper bound}
  If $\strategy$ is $\shift$-invariant, then
  \[\gamevalue(\strategy) \leq \min_{j \in \targets} \values(j) \tau_j \sum_{s \in \pstates} \behavioral(s) \Gamma(s, j).\]
\end{restatable}

Note that if $\strategy$ has a hidden Markov model, then probabilities $\behavioral(s)$ satisfy network-flow conditions on graph $\pspace$.
Therefore the following linear program computes an upper bound on $\gamevalue(\strategy)$ for any strategy $\strategy$ that admits a hidden Markov model.
\begin{maxi}{\xi, \stationary, \transition}{\xi}{\label{lp:upper bound}}{}
\addConstraint{\sum_{w \in \pstates} \stationary_w = 1}{\ }
\addConstraint{\sum_{v \in \pstates \colon (w, v) \in \pactions} \transition_{w, v} = \stationary_w}{\text{ for } w \in \pstates}
\addConstraint{\sum_{v \in \pstates \colon (v, w) \in \pactions} \transition_{v, w} = \stationary_u}{\text{ for } w \in \pstates}
\addConstraint{\xi \leq \values(j) \tau_j \sum_{s \in \pstates} \sigma_s }{ \coverage(s, j) \text{ for } j \in \targets}
\addConstraint{\xi \in \mathbb{R}}{\ }
\addConstraint{\stationary_w \in [0, 1]}{\text{ for } w \in \pstates}
\addConstraint{\transition_{w, v} \in [0, 1]}{\text{ for } (w, v) \in \pactions}
\end{maxi}

%For the San Francisco police department example introduced in~\cite{john2023rosso}, the linear program~\eqref{lp:upper bound} computes an upper bound $28\%$, which improves a bound $92\%$ obtained via \cite[Theorem 5]{duan2021stochastic}. Note that the bound is computed under an assumption that the attacker may attack at any point in time, even when the patrolling unit is travelling along a long edge of the physical graph.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% SHIELD
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{SHIELD} \label{sec:lp algorithm}

In this section we introduce \textbf{Security Heuristic for Intrusion Exposure and Location Defense (SHIELD)} -- an algorithm that constructs a nearly optimal hidden Markov model strategy for the defender for a patrolling game with a fixed set of actionable observations of the attacker. 
The setup used for the algorithm is identical with one used in Section~\ref{sec:switch of perspective}. 

First, we construct a space $\sspace$ with memory of length~$\max_{i \in I} |i| + \max_{j \in \targets} \tau_j - 1$. 
Although any such space works, the choice restricts the set of available strategies $\strategy$, so the game value $\gamevalue_I(\strategy)$ depends on this choice.

By Theorem~\ref{thm:perspective}, the goal of the defender is to find the stationary distribution $\sigma$ of a Markov chain on $\sspace$ with the maximal value
$\gamevalue^\ast = \max_{\sigma} \gamevalue_\actionable(\mu)$.
Thus, if we fix $\xi \in \mathbb{R}$, the following linear problem is feasible iff $\xi \leq \gamevalue^\ast$.
\begin{maxi}{\stationary, \transition}{0}{\label{lp:game value}}{}
\addConstraint{\sum_{w \in \sstates} \stationary_w = 1}{}
\addConstraint{\sum_{v \in \sstates \colon (w, v) \in \sactions} \transition_{w, v} = \stationary_w}{\text{ for } w \in \sstates}
\addConstraint{\sum_{v \in \sstates \colon (v, w) \in \sactions} \transition_{v, w} = \stationary_w}{\text{ for } w \in \sstates}
\addConstraint{\sum_{w \in H_{i, \tau_j - 1}} \stationary_w \left( \tilde \payoff_j(w) - \xi\right) \geq 0,}{\ i \in \actionable, j \in \targets}
\addConstraint{\stationary_w \in [0, 1]}{\text{ for } w \in \sstates}
\addConstraint{\transition_{w, v} \in [0, 1]}{\text{ for } (w, v) \in \sactions}
\end{maxi}

Having the above linear formulation, we can approximate the value of $\gamevalue^\ast$ arbitrarily well by using the bisection method (see pseudocode in Algorithm~\ref{alg:bisect solver}).

\begin{algorithm}[t!]
\caption{Approximating $\gamevalue^\ast$ using the bisection method.}
\label{alg:bisect solver}
\begin{algorithmic}[1]
\small
\Input{Strategy space $\sspace$, set of actionable observations $\actionable$, set of attack plans $\targets$, defender payoff function $\payoff$, $\epsilon > 0$.}
\Output{The lower and upper bound of $\gamevalue^\ast$ precise up to $\epsilon$.}
\State $\gamevalue^\ast_L \gets 0$
\State $\gamevalue^\ast_U \gets \max_{j \in \targets, v \in \sstates}\payoff_j(v)$
\While {$\gamevalue^\ast_U - \gamevalue^\ast_L > \epsilon$}
    \State $\gamevalue^\ast_M \gets (\gamevalue^\ast_U + \gamevalue^\ast_L) / 2$
    \If {linear problem \ref{lp:game value} feasible with $\xi=\gamevalue^\ast_M$}
        \State $\gamevalue^\ast_L \gets \gamevalue^\ast_M$
    \Else
        \State $\gamevalue^\ast_U \gets \gamevalue^\ast_M$
    \EndIf
\EndWhile
\State \Return $\gamevalue^\ast_L, \gamevalue^\ast_U$

\end{algorithmic}
\end{algorithm}

\begin{example}
    To finish our example of the port of Gdynia, consider the network presented in Figure~\ref{fig:gdynia} with all docks being equally valuable to the defender, and all non-docks being worthless. Assume that each USV provides coverage $1$ for dock corresponding to the node where it is positioned, and coverage $\frac{1}{2}$ to adjacent docks. Moreover, assume that it takes $3$ time units to attack each of the docks, and the actionable observations of the attacker are all sequences of length $1$, i.e., the attacker makes their decision based on the current positions of the USVs. In such a setting, the optimal probability of capturing the attacker calculated by our algroithm is $0.25$ if the defender has one USV at their disposal, and it grows to $0.95$ if we add another USV.
\end{example}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/sf_mc}    
    \caption{The defender utility in subgraphs of the San Francisco network. Each line corresponds either to value computed either via our linear program (LP) or via Monte Carlo simulations (MC) with different values of attacker's observation length. Each Mone Carlo data point is an average over $10^3$ roll-outs with $10^3$ actions each. The colored areas (extremely narrow) represent $95\%$ c.i.}
    \label{fig:sf-monte-carlo}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experimental evaluation}\label{sec:experimental evaluation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We evaluate our solution on real-life and random networks.
%We first evaluate our solution by comparing it to a police department case study presented in the work by \cite{john2023rosso}. Then, we study the performance of our algorithm against an attacker with shorter/longer observation length. Finally, we evaluate our algorithm given various underlying network topologies.
Our implementation is published at
{\tt\footnotesize
github.com/anagorko/stackelberg-games-core}.

\subsection{San Francisco police district}

In the model by \cite{john2023rosso}, the defender controls a single patrol unit, and the set of targets consists of twelve intersections in downtown San Francisco. The targets are connected into a weighted clique with integer weights representing the minutes of travel time between intersections, ranging from $1$ to $9$, with attack times of the targets ranging from $6$ to $11$. The attacker observes the current position of the security unit, i.e., the observation length is $1$. The authors of \cite{john2023rosso} use a JAX-based gradient optimizer to find a patrolling solution with utility equal to $0.102$. Our algorithm is able to identify a patrolling solution with the lower bound of the defender utility equal to $0.193$. \textbf{In other words, we are able to find a defense strategy where the probability of apprehending the attacker is almost two times greater than the state of the art}. 
%\todo[inline]{Czy jestesmy w stanie opisac intuicje dlaczego nasz model daje lepszy wynik niz Rosso?}


\subsection{Sensitivity to observation length}

Our algorithm computes an optimal strategy against an attacker with a given observation length $h$. However, it remains unclear how would such strategy fare against an attacker with other observation lengths. To investigate it we now run Monte Carlo simulations on increasingly large induced subgraphs of the San Francisco network. For each subgraph, we generate $10^3$ roll-outs of the defender strategy consisting of $10^3$ actions each. We then assume that the attack has either observation length $h$, $h+1$, or $h-1$, and they select the target and observation from the roll-out that yield the smallest average risk of getting caught, i.e., the greatest utility of the attacker.
%Theorem~\ref{thr:monte_carlo} guarantees that with growing roll-out length the average defender utility converges to the expected value. However, notice that the procedure always underestimates the utility of defender. The strategy computed by our algorithm is optimal in the stationary distribution, but the distribution of observations in the roll-out is always different, which can be exploited by the attacker. As an analogy, imagine a repeated game of coin toss, where we perform $n$ tosses first, and then we allow our opponent to select heads or tails. The utility of the opponent will never be smaller than $\frac{n}{2}$.
Figure~\ref{fig:sf-monte-carlo} presents the results. %As can be seen from the figure, 
The theoretical bound computed by the linear program is confirmed by our simulations. Moreover, the strategy is even more successful against an opponent with shorter observation length. Unfortunately, an attacker with longer observation length is able to capitalize on the strategy optimized against a weaker opponent and inevitably avoids detection.


% \subsection{Sound projection evaluation}

% \todo[inline]{Sound projection evaluation section}


\subsection{Evaluation on random networks}

To evaluate the effects of the network size and structure on the outcomes of our experiments, we also perform simulations with randomly networks generated. To this end, we use \BAn \citeyear{barabasi1999emergence}, \ERn \citeyear{erdds1959random}, and \WSn \citeyear{watts1998collective} models. We generate networks of varying size, while setting the average degree of a node to $2$. In the case of the \WSn model we set the rewiring probability to $\frac{1}{4}$. For each such network we calculate the utility of the defender using a single security resource. All simulations in this section are run on a computer with Intel Core i7-11700K CPU, and 16 GB RAM.
\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/ba2_1_util_time}    
    \caption{The left plot presents the mean utility of the defender, while the right one the mean runtime. Each data point is an average over $100$ \BAn networks. The colored areas (very narrow) represent $95\%$ c.i.}
    \label{fig:ba-simulations}
\end{figure}
Figure~\ref{fig:ba-simulations} presents the results of our simulations for the \BAn, the results for the other two models can be found in the supplementary materials, and exhibit similar trends. As can be seen %from the figure, 
both increasing the observation length of the attacker, as well as decreasing the attack time of the nodes can significantly lower the utility of the defender. In particular, an attacker with observation length zero becomes more dangerous if we give them the ability to observe the defender's activities than if we decrease their attack time. Unfortunately, increasing the observation length results in a sharp growth of the run time required to compute the optimal strategy, exacerbating the danger posed by a well-informed attacker.

\section{Conclusions}

In this work, we proposed a model on the interface of stochastic patrolling and game theoretic models. We constructed an effective algorithm and showed that it improved upon state of the art for some settings in the literature.


%\begin{contributions} % will be removed in pdf for initial submission 
%					  % (without ‘accepted’ option in \documentclass)
%                      % so you can already fill it to test with the
%                      % ‘accepted’ class option
%    Briefly list author contributions. 
%    This is a nice way of making clear who did what and to give proper credit.
%    This section is optional.
%
%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}
%
%\begin{acknowledgements} % will be removed in pdf for initial submission,
%						 % (without ‘accepted’ option in \documentclass)
%                         % so you can already fill it to test with the
%                         % ‘accepted’ class option
%    Briefly acknowledge people and organizations here.
%
%    \emph{All} acknowledgements go in this section.
%\end{acknowledgements}


% References
\bibliography{references}

\newpage

\onecolumn

\title{General Markov Model for Solving Patrolling Games\\(Supplementary Material)}
\maketitle


\appendix


\section{Auxiliary definitions}

\subsection{Long edge subdivision}\label{sec:long edge subdivision}

Long edge subdivision used in Section~\ref{sec:physical space} is described as follows.

Given a route $r \in R_u$ connecting two vertices $l_i$ and $l_j$ such that $|r|>1$, we add $|r|-1$ intermediate vertices  $l^r_1, \ldots, l^r_{|r|-1}$ between the nodes $l_i$ and $l_j$ and connect them by new edges, each of length 1, i.e., instead of having a long edge $r = (l_i, l_j) \in R_u$ we now have the following $|r|$ directed short edges (i.e., each of length 1):
$r_0=(l_i,l^r_1)$, $r_1=(l^r_1,l^r_2)$, $\ldots$, $r_k = (l^r_k,l^r_{k+1})$, $\ldots$, $r_{|r|-1} = (l^r_{|r|-1}, l_j)$.

This way, we obtain a graph $\topology'_u = (L'_u, R'_u)$ with broken down edges, where 
$$L'_u = L_u \cup \bigcup_{r\in R_u: |r| >1} \{l^r_k: k=1, \ldots, |r|-1\},$$
and
\begin{align*} & R'_u = \left(R_u \setminus \{r \in R_u: |r|>1\}\right) \cup \\ &\cup \bigcup_{r=(l_i,l_j)\in R_u: |r|>1} \{(l_i,l^r_1), (l^r_1,l^r_2), \ldots, (l^r_{|r|-1}, l_j)\}.
\end{align*}



\subsection{Tensor product of graphs}

Given two graphs $G=(V_G, E_G)$ and $H = (V_H, E_H)$, the tensor product $G \times H$ of these graphs is defined as follows: the set of vertices $V_{G \times H}$ is equal to the cartesian product $V_G \times V_H$ of the sets of vertices of $G$ and $H$, and a pair $((v_0, u_0)(v_1,u_1))$ is an edge of $G \times H$ iff  $(v_0, v_1) \in E_G$, and $(u_0, u_1) \in E_H$. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Proofs
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage
\section{Proofs}

\subsection{The proof of Lemma~\ref{lem:robustness}}

\robustnesslemma*
\begin{proof}
From the definition of $\gamevalue_\actionable(\strategy)$,
\[
\gamevalue_\actionable(\strategy) = \inf_{\atime \in \mathbb{N}} \min_{i \in \actionable}
\min_{j \in \targets} \E\giventhat*{G^{\atime + |i| - 1}_j}{\cone{L^\atime i}}.
\]
By setting $I = \{ i \}$ for $i \in \pstates^\ast_+$ and $\atime = 0$, we obtain
\[
\inf_{I \subset \pstates^\ast_+} \gamevalue_\actionable(\strategy) \leq \inf_{i \in \pstates^\ast_+} \gamevalue_{\{ i \}}(\mu)
\leq \inf_{i \in \pstates^\ast_+} \min_{j \in \targets} \E\giventhat*{G^{0 + |i| - 1}_j}{\cone{L^0 i}}
= \gamevalue(\strategy),
\]
with the last equality coming from the definition of $\gamevalue(\strategy)$. Hence 
\[
\inf_{I \subset \pstates^\ast_+} \gamevalue_\actionable(\strategy) \leq \gamevalue(\strategy).
\]

To prove the reverse inequality, first observe that $\cone{L^\atime i} = \bigsqcup_{p \in L^\atime} \cone{pi}$ for each $i \in \actionable$ and $\atime \in \mathbb{N}$, where $\bigsqcup$ denotes disjoint union. Hence, by the law of total expectation,
\[
\E\giventhat*{G^{\atime + |i| - 1}_j}{\cone{L^\atime i}} =
\sum_{p \in L^\atime} \frac{\strategy(\cone{pi})}{\strategy(\cone{L^\atime i})} \E\giventhat{G^{\atime + |i| - 1}_j}{\cone{pi}}.
\]
Since $\sum_{p \in L^\atime} \frac{\strategy(\cone{pi})}{\strategy(\cone{L^\atime i})} = 1$, we have
\[
\sum_{p \in L^\atime} \frac{\strategy(\cone{pi})}{\strategy(\cone{L^\atime i})} \E\giventhat*{G^{\atime + |i| - 1}_j}{\cone{pi}} \geq \min_{p \in L^\atime} \E\giventhat*{\payoff^{\atime + |i| - 1}_j}{\cone{pi}}.
\]
Finally,
\begin{align*}
\gamevalue(\strategy) & =
\inf_{i \in\pstates^\ast_+} \min_{j \in \targets} \E\giventhat*{\payoff^{|i|-1}_j}{\cone{i}} = \\
& = \inf_{\atime \in \mathbb{N}} \min_{p \in \pstates^\atime} \inf_{i \in \pstates^\ast_+} \min_{j \in \targets} \E\giventhat{\payoff^{|pi| - 1}_j}{\cone{pi}} = \\
& = \inf_{\atime \in \mathbb{N}} \inf_{i \in \pstates^\ast_+} \min_{j \in \targets} \left( \min_{p \in \pstates^\atime} \E\giventhat*{\payoff^{\atime + |i| - 1}_j}{\cone{pi}} \right)
\leq \\
& \leq \inf_{\atime \in \mathbb{N}} \inf_{i \in \actionable} \min_{j \in \targets} 
\E\giventhat*{G^{\atime + |i| - 1}_j}{\cone{\pstates^\atime i}} = \gamevalue_\actionable(\strategy).
\end{align*}

\end{proof}



\subsection{A value of the game -- general formulation}\label{proof:game value}

\gamevaluelemma*

\begin{proof}
From the definition of conditional expected value,
\begin{align*}
\E\giventhat*{G_j^{\atime + |i|-1}}{\cone{L^\atime i}}
= \frac 1{\strategy(\cone{L^\atime i})} \int_{\cone{L^\atime i}} G_j^{\atime + |i|-1} \dif{\strategy}.
\end{align*}
Hence, we have to prove that:
\begin{enumerate}
    \item $\displaystyle \strategy(\cone{L^\atime i}) = \sum_{p \in L^\atime i} \behavioral(p)$,
    \item $\displaystyle \int_{\cone{L^\atime i}} G_j^{\atime + |i|-1} \dif{\strategy} = \sum_{p \in L^\atime i L^{\tau_j-1}} \behavioral(p) \payoff^{\atime + |i| - 1}_j(p)$.
\end{enumerate}

From the definition of a cone, we have
\[
  \cone{L^\atime i} = 
  \bigsqcup_{q \in \pstates^\atime} \cone{qi} =
  \bigsqcup_{q \in \pstates^\atime}
  \bigsqcup_{r \in \pstates^{\tau_j-1}} \cone{qir},
\]
where $\bigsqcup$ denotes a disjoint union (notice that $p$ iterates over all possible sequences of the length $\tau_j-1$ that can be the extensions of $qi$).

Therefore, from the definition of $\behavioral$ we have
\[
\strategy(\cone{L^\atime i}) =
\strategy\left(\bigsqcup_{q \in \pstates^\atime} \cone{qi}\right) =
\sum_{q \in \pstates^\atime} \strategy(\cone{qi}) =
\sum_{q \in \pstates^\atime} \behavioral(qi) =
\sum_{p \in L^\atime i} \behavioral(p)
\]
which completes the proof of the first point.

Moreover, we have
\begin{align*}
\int_{\cone{L^\atime i}} G_j^{\atime + |i|-1} \dif{\strategy} &= \sum_{q \in L^\atime} \sum_{r \in \pstates^{\tau_j-1}} \int_{\cone{qir}} \payoff^{\atime + |i|-1}_j \dif{\strategy} =\\
&= \sum_{q \in L^\atime} \sum_{r \in \pstates^{\tau_j-1}} \strategy(\cone{qir}) \payoff^{\atime + |i|-1}_j(qir) =\\
&= \sum_{q \in L^\atime} \sum_{r \in \pstates^{\tau_j-1}} P(qir) \payoff^{\atime + |i|-1}_j(qir) =\\
&= \sum_{p \in L^\atime i L^{\tau_j-1}} \behavioral(p) \payoff^{\atime + |i| - 1}_j(p),
\end{align*}
since $\payoff^{\atime + |i|-1}_j$ is constant on $\cone{qir}$ and equal to $\payoff^{\theta + |i| - 1}_j(qir)$ by the assumption that attack plan $j$ resolves within $\tau_j$ turns. This completes the proof of the second point, and the Lemma.
\end{proof}



\subsection{A value of the game -- shift-invariant strategy}\label{proof:invariant game value}

\invariantgamevalue*

\begin{proof}
From the definition of conditional expected value,
\[
\E\giventhat*{G_j^{\atime + |i|-1}}{\cone{L^\atime i}}
= \frac 1{\strategy(\cone{L^\atime i})} \int_{\cone{L^\atime i}} G_j^{\atime + |i|-1} \dif{\strategy}.
\]
From $\shift$-invariance of $\strategy$, i.e. $\mu = \mu \circ \shift^{-1}$, we have
\[
\strategy(\cone{L^\atime i}) = \strategy(L^\atime \cone{i}) = \strategy(\cone{i}) = 
\behavioral(i).
\]
Using integration by substitution,
\[
\int_{\cone{L^\atime i}} G_j^{\atime + |i|-1} \dif{\strategy} = 
\int_{\shift^{-\atime}(\cone{i})} G_j^{|i|-1} \circ \shift^{\atime} \dif{\strategy} =
\int_{\cone{i}} G_j^{|i|-1} \dif{\strategy}.
\]
\end{proof}

\begin{comment}
%%% Probability computation in hidden Markov models
\subsection{Proof of Lemma~\ref{lem:probability computation}}\label{proof:probability computation}
\probabilitycomputation*
\begin{proof}
\todo[inline]{Write a proof.}    
\end{proof}

%%% Upgrade discrete strategy to a hidden Markov model
\subsection{Proof of Theorem~\ref{thm:existence of hidden markov model}}\label{proof:existence of hidden markov model}
\discretetohidden*
\begin{proof}
\todo[inline]{Write a proof.}       
\end{proof}
\end{comment}


\subsection{The proof of Theorem~\ref{thm:game value}}

\gamevaluetheorem*

\begin{proof}
From the definition of $\gamevalue_\actionable(\strategy)$,
\[
\gamevalue_\actionable(\strategy) = \inf_{\atime \in \mathbb{N}} \min_{i \in \actionable}
\min_{j \in \targets} \E\giventhat*{\payoff^{\atime + |i| - 1}_j}{\cone{L^\atime i}}.
\]
By Lemma~\ref{lem:invariant game value},
\[
\E\giventhat*{\payoff_j^{\atime + |i| - 1}}{\cone{L^\atime i}}
= \frac 1{\behavioral(i)} \sum_{p \in i L^{\tau_j-1}} \behavioral(p) \payoff^{|i| - 1}_j(p).
\]
Hence
\begin{align*}
\gamevalue_\actionable(\strategy) = 
\inf_{\atime \in \mathbb{N}} \min_{i \in \actionable} \min_{j \in \targets}
\frac 1{\behavioral(i)} \sum_{p \in i L^{\tau_j-1}} \behavioral(p) \payoff^{|i| - 1}_j(p) = \\
= \min_{i \in \actionable} \min_{j \in \targets}
\sum_{p \in i L^{\tau_j-1}} \frac{\behavioral(p)}{\behavioral(i)} \payoff^{|i| - 1}_j(p) = \\
= \min_{i \in \actionable} \min_{j \in \targets}
\sum_{p \in L^{\tau_j-1}} \frac{\behavioral(ip)}{\behavioral(i)} \payoff^{|i| - 1}_j(ip) = \\
= \min_{i \in \actionable} \min_{j \in \targets}
\sum_{p \in L^{\tau_j}} \behavioral(i \sim p) \payoff_j(p).
\end{align*}
\end{proof}



\subsection{A non-linear formulation for probability of following a path}

\probabilitycomputation*

\begin{proof}
  Let $q \in \sstates^\ast$ such that $\projection_\ast(q) = p$.
  From the definition of a Markov measure~\cite[Definition 1.8]{sarig2009lecture}, we have
  \[
  \mcmeasure(\cone{q}) = \stationary_{q_0} \prod_{i=0}^{|p|-2} N_{q_i, q_{i+1}}.
  \]
  We have $\projection^{-1}_\ast(\cone{p}) = \bigsqcup_{q \in \projection_\ast^{-1}(p)} \cone{q}$. where $\bigsqcup$ denotes a disjoint union.
  From the definition of a push-forward measure,
  \begin{align*}
  \strategy(\cone{p}) & = 
  (\mcmeasure \circ \projection^{-1})(\cone{p}) =
  \mcmeasure(\projection^{-1}_\ast(\cone{p})) = \\ & =
  \mcmeasure\left(\bigsqcup_{q \in \projection_\ast^{-1}(p)} \cone{q}\right) = 
  \sum_{q \in \projection_\ast^{-1}(p)} \mcstationary_{q_0} \prod_{i=0}^{|p|-2} N_{q_i, q_{i+1}}.  
  \end{align*}
\end{proof}



\subsection{Existence of a memory function}\label{proof:his}

\memoryfunction*
\begin{proof}
Fix the space $(S,A)$, the observation function $Z$ and assume the space has memory of length $t$ w.r.t. $Z$. We need to demonstrate that for every $i \leq t$ the value $M_Z(s,i)$ is well-defined. Indeed, fix $i \leq t-1$ and fix $s \in S$. Then, since the space has the memory of length $t$ with respect to $Z$, by looking at the contrapositive reading of the condition defining the memory length it is trivial to note that the observation $Z(p_{t-1-i})$ is uniquely determined for any sequence of actions $a_1, \ldots, a_i \in A$ leading from $p_{t-1-i}$ to $s$. Therefore, the definition $M_Z(s,i):= Z(p_{t-1-i})$ is correct.
\end{proof}



\subsection{The proof of Lemma~\ref{lem:switch}}

\switchlemma*

\begin{proof}
From the definition, we have
\[
\behavioral(i) = \mu(\cone{i}) = \nu(\projection^{-1}_\ast(\cone{i})) = \nu(\projection^{-1}_\ast(i) \cdot \sstates^t \cdot \ssubshift),
\]
where $\nu$ is a hidden Markov model for $\strategy$.
Observe that from the assumption that $\sstates$ has a memory of length $|i| + t$ with respect to $\projection$, we have
\[
\hat H_{i, t} \cap \hat H_{j, t} = \emptyset \text{ for all } j \in \pstates^{|i|} \text{ such that } i \neq j.
\]
It follows that
\[
X^{-1}(i) \cdot S^t = \left\{ q \in \sstates^{|i| + t} \colon q_{|i| + t} \in \hat H_{i, t} \right\} = \sstates^{|i| + t - 1} \cdot \hat H_{i, t}.
\]
Therefore,
\[
\nu(\projection^{-1}_\ast(i) \cdot \sstates^t \cdot \ssubshift) = \nu(\sstates^{|i| + t - 1} \cdot \hat H_{i, t} \cdot \ssubshift) = \nu(\hat H_{i, t} \cdot \ssubshift),
\]
the last equality from $\shift$-invariance of $\nu$.
Finally, from additivity of $\nu$ and from Lemma~\ref{lem:probability computation}, we have  \[
\nu(\hat H_{i, t} \cdot \ssubshift)
= \sum_{s \in \hat H_{i,t}} P(\cone{s}) = \sum_{s \in H_{i, t}} \sigma_s.
\]
\end{proof}



\subsection{The proof of Theorem~\ref{thm:perspective}}\label{proof:perspective}

\perspective*

\begin{proof}
We have $i \in I \subset \pstates^\ast$ and 
\begin{align*}
H_{i,\tau_j - 1} & = \{ s \in \sstates \colon M_\observation(s, \tau_j - 1) = i \} = \\
& = \left\{s \in \sstates \colon M_X(s, \tau_j - 1 + k)= i_{|i| - 1 - k} \text{ for } k = 0, 1, \ldots, |i| - 1 \right\} =\\
& = \left\{ s \in \sstates \colon \left(\projection^{-1}_\ast(i) \cdot \sstates^{\tau_j-1} \right) \cap \left( \sstates^\ast \cdot s \right) \neq \emptyset \right\}
= \hat H_{i, \tau_j - 1}.
\end{align*}
Hence by Lemma~\ref{lem:switch}, we have $\behavioral(i) = \sum_{s \in H_{i, \tau_j - 1}} \sigma_s$.
Recall that
\[
\widetilde\payoff_j(s) = G_j(M_X(s, \tau_j-1), M_X(s, \tau_j-2), \ldots, M_X(s, 0)).
\]
From Theorem~\ref{thm:game value}, we have
\begin{align*}
\gamevalue_I(\strategy) & = 
\min_{i \in I} \min_{j \in \targets} \sum_{p \in L^{\tau_j}} \behavioral(i \sim p) \payoff_j(p) = \\
& = \min_{i \in I}  \min_{j \in \targets}  \sum_{p \in i_{|i| - 1} L^{\tau_j - 1}} \frac{\behavioral(i\shift(p))}{\behavioral(i)} \payoff_j(p) = \\
& = \min_{i \in I}  \min_{j \in \targets} \frac {\sum_{p \in i_{|i| - 1} L^{\tau_j - 1}} \behavioral(i\shift(p)) \payoff_j(p)}{\sum_{s \in H_{i, \tau_j - 1}} \sigma_s}.
\end{align*}

Note that $\widetilde \payoff_j$ is constant on $\hat H_{i\shift(p), 0}$ since $\sspace$ has a memory of length $\tau_j$ with respect to $X$.
We also have 
\[
H_{i, \tau_j - 1} = \bigsqcup_{p \in i_{|i|-1} \pstates^{\tau_j - 1}} \hat H_{i \shift(p), 0}.
\]
Hence
\begin{align*}
\sum_{p \in i_{|i|-1} L^{\tau_j - 1}} \behavioral(i\shift(p)) \payoff_j(p)
& =  \sum_{p \in i_{|i|-1} L^{\tau_j - 1}} \left(\sum_{s \in \hat H_{i \shift(p), 0}} \sigma_s\right) G_j(p)
= \\
& =
\sum_{p \in i_{|i|-1} L^{\tau_j - 1}} \left(\sum_{s \in \hat H_{i \shift(p), 0}} \sigma_s \tilde G_j(s)\right) 
= \sum_{s \in H_{i, j}} \sigma_s \widetilde \payoff_j(s).
\end{align*}

\end{proof}



%%% Upper bound theorem
\subsection{Proof of Theorem~\ref{thm:upper bound}}\label{proof:upper bound}

\upperbound*

\begin{proof}
Directly from the definition we have
\begin{align*}
\gamevalue(\strategy) \leq \min_{j \in \targets} \min_{i \in \pstates} \E\giventhat*{\payoff_j}{\cone{i}}.
\end{align*}
Since the attack plan $j$ resolves in $\tau_j$ turns, we have (cf. proof of Lemma~\ref{lem:invariant game value})
\[
\E\giventhat*{\payoff_j}{\cone{i}} = \frac 1{\behavioral(i)} \sum_{p \in iL^{\tau_j - 1}} \behavioral(p) \payoff_j(p).
\]
Since $\sum_{i \in \pstates} \behavioral(i) = 1$, we have
\begin{align*}
\min_{i \in \pstates} \sum_{p \in iL^{\tau_j - 1}} \frac{\behavioral(p)}{\behavioral(i)} \payoff_j(p)
\leq \sum_{i \in \pstates} \behavioral(i) \sum_{p \in i\pstates^{\tau_j - 1}} \frac{\behavioral(p)}{\behavioral(i)} \payoff_j(p)
= \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \payoff_j(p).
\end{align*}
Hence
\[
\gamevalue(\mu) \leq \min_{j \in \targets} \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \payoff_j(p).
\]

Let's notice that:
\[
D_j(p) = 1-\prod_{t=0}^{\tau_j - 1} \left(1-\coverage(p_t, j)\right)
\leq \sum_{t = 0}^{\tau_j - 1} \coverage(p_t, j),
\]
and 
\[
\sum_{p \in \pstates^{\tau_j}} \behavioral(p) \left(\sum_{t = 0}^{\tau_j-1} \coverage(p_t, j) \right) = \sum_{t=0}^{\tau_j-1} \sum_{p \in \pstates^{\tau_j}} \coverage(p_t, j) P(p) \leq \sum_{s \in L} \coverage(s, j) \left[ \sum_{t=0}^{\tau_j-1} \sum_{p \in \pstates^{\tau_j}} \delta_{p_t = s} \behavioral(p) \right],
\]
where $\delta$ is the Kronecker delta (i.e., $\delta_{p_t = s} = 1$ if $p_t = s$, and $\delta_{p_t = s} = 0$ otherwise).
Since $\strategy$ is $\shift$-invariant, for each $s \in \pstates$ and each $t$ we have
\[
\sum_{p \in \pstates^{\tau_j}} \delta_{p_t = s} \behavioral(p) = P(s),
\]
hence
\[
\sum_{p \in \pstates^{\tau_j}} \behavioral(p) \left(\sum_{t = 0}^{\tau_j-1} \coverage(p_t, j) \right) \leq \tau_j \sum_{s\in \pstates} \coverage(s, j) \behavioral(s).
\]
Therefore
\begin{align*}
\gamevalue(\mu) &\leq \min_{j \in \targets} \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \payoff_j(p) \\
&=\min_{j \in \targets} \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \left(1 - \prod_{t = 0}^{\tau_j-1} (1 - \coverage(p_t, j))\right) \values(j) \\
&=\min_{j \in \targets} \values(j) \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \left(1 - \prod_{t = 0}^{\tau_j-1} (1 - \coverage(p_t, j))\right) \\
&\leq \min_{j \in \targets} \values(j) \sum_{p \in \pstates^{\tau_j}} \behavioral(p) \left(\sum_{t = 0}^{\tau_j-1} \coverage(p_t, j) \right) \\
&\leq \min_{j \in \targets} \values(j)\tau_j\sum_{s \in \pstates}\behavioral(s)\coverage(s, j).
\end{align*}

\end{proof}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Examples
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage
\section{Examples}\label{sec:example strategies}

\subsection{The model from \texorpdfstring{\cite{john2023rosso}}{} as an instance of our general model}

%, and $W$ is is a weight matrix: its $(i,j)$-th entry is the integer travel time between waypoint $i$ and waypoint $j$. 
%For $|L|=n$ we have an $n$-state Markov chain with a stochastic transition matrix $P \in [0,1]^{n \times n}$, where each entry $p_{ij}$ represents the probability of the surveillance agent moving along the edge from $l_i$ to $l_j$.  If the chain $P$ is irreducible, it has a unique distribution $\pi = (\pi_1, \ldots, \pi_n) \in [0,1]^n$ s.t. $\pi  P = \pi$, representing the agent's long-term visit frequency to each node in the graph. If we let Let $X_k \in \{1, \ldots, n\}=[n]$ to be the value of  $P$ at time $k$, i.e., the node visited at time $k$, then the first hitting time $t_{ij} = \min\{k: X_0 = l_i, X_k=l_j, k \geq 1 \}$ is a random variable representing the number of time periods between the agent leaving the node $l_i \in L$ and their arrival to node $l_j \in L$. In the model described, a surveillance agent patrols a graph according to $P$, and the attacker chooses a single node $i$ and remains stationary there. The surveillance agent captures the attacker if she visits the attacker's node within $\tau_i \in \mathbb{N}$ time periods. Otherwise, the attacker succeeds. According to the Stackelberg Game formulation, the payoff of the defender is:
%$J_{SG}(P) = \min_{i,j} p\left(t_{ij}(P) \leq \tau_j\right)$. An optimal defender strategy is: $P^\ast = \arg\max_P J_{SG}(P)$. The minimization reflects the attacker's choice of a
%node $j$ to attack while the patrolling unit remains at  the node $i$. The arg-maximization reflects the patroller's desire to maximize the probability of visiting the attacker's node via the choice of $P$. 
Let us describe the framework by \cite{john2023rosso} that analyzes stochastic surveillance strategies of randomized patrolling robots, using the formalism of our model.
 Let $H=(L,R)$ be a physical state and action space. Assume that a Markov chain $P$ is given and that the defender patrols $H$ according to $P$, and let $\mu_P$ be the defender strategy derived from $P$. Consider the set of attack plans $T$ to be a subset of $L$ (where each target $l_j \in T$ implicitly contains the information on the attack-time $\tau_j$ of the vertex $l_j$). The set of strategies of the attacker is $$\{(\lambda, l_j): \lambda \in L^\ast, l_j \in L\},$$  where $\lambda$ is a (finite) sequence of nodes of $L$ visited by the patrolling unit until the moment of the attack. If we let $X_k$ to be the value of $P$ at time $k$, i.e., the node visited at time $k$, then $$t_{ij} = \min\{k: X_0 = l_i, X_k=l_j\}$$ is a random variable representing the number of time periods between the agent leaving the node $l_i \in L$ and their arrival to node $l_j \in L$. The payoff of the defender $G_j(p)$ is equal to 1, if $t_{ij} \leq \tau_j$, and $-1$  otherwise, where $p_0=l_i$, i.e., if at the time 0 of the schedule $p$ the defender is located in the node $l_i$. The game value of the strategy $\mu$ of the defender is then simply $$V(\mu) = \min_{l_i \in L} \min_{j \in T} \mu\left(t_{ij} \leq \tau_j\right).$$


\subsection{A star graph}
Consider a physical and strategy states shown in Figure~\ref{fig:y-space}.
There is a single patrolling unit. The set of targets is the set of vertices of the physical space and the attack time for each vertex is equal to $3$.
We assume that we are playing against an opponent with observation length $1$, i.e. 
an opponent that observes the current position of the patrolling unit.

\begin{figure}[H]
    \centering
    \begin{tikzpicture}[
    Circ/.style={circle, draw, inner sep=2pt, minimum size=2pt},
    MidArrow/.style={
        draw, postaction={decorate,decoration={markings,mark=at position 0.5 with {\arrow{>}}}}}
    ]
    \begin{scope}[shift={(0:-3)}]
        \node[] at (-1.5,1.5) {$(S,A)$};
        
        \node[Circ](Centre1) at (0:0.5) {};
        \node[Circ](Centre2) at (0:0) {};
        \node[Circ](Centre3) at (180:0.5) {};
        \node[Circ](Arm1) at (320:2) {};
        \node[Circ](Arm2) at (90:1.5) {};
        \node[Circ](Arm3) at (220:2) {};
    
        \draw[MidArrow] (Arm1) to [bend right=100] (Centre1) {};
        \draw[MidArrow] (Centre2) to node[above right] {$\frac 12$} (Arm1) {};
        \draw[MidArrow] (Centre3) to [bend right=20] node[below left] {$\frac 12$} (Arm1) {};
        \draw[MidArrow] (Centre1) to [bend right=20] node[right] {$\frac 12$} (Arm2) {};
        \draw[MidArrow] (Arm2) to (Centre2) {};
        \draw[MidArrow] (Centre3) to [bend left=20] node[left] {$\frac 12$} (Arm2) {};
        \draw[MidArrow] (Centre1) to [bend left=20] node[below right] {$\frac 12$} (Arm3) {};
        \draw[MidArrow] (Centre2) to node[above left] {$\frac 12$} (Arm3) {};
        \draw[MidArrow] (Arm3) to [bend left=100] (Centre3) {};
    
        %\draw[MidArrow] (Arm1) to (Centre1) {};
        %\draw[MidArrow] (Arm2) to (Centre2) {};
        %\draw[MidArrow] (Arm3) to (Centre3) {};

        \draw[dashed] circle[x radius=0.75,y radius=0.3];
    \end{scope}

    \draw[->] (-1,0) -- (0,0) node[anchor=south, inner sep=4pt]{$X$} -- (1,0);
    
    \begin{scope}[shift={(0:2)}]
        \node[] at (1, 1.5) {$(L,R)$};
    
        \node[Circ](Centre) at (0:0) {};
        \node[Circ](Arm1) at (300:1.5) {};
        \node[Circ](Arm2) at (90:1.5) {};
        \node[Circ](Arm3) at (240:1.5) {};
    
        \draw[MidArrow] (Centre) to [bend left=20] node[right] {$\frac 13$} (Arm1) {};
        \draw[MidArrow] (Centre) to [bend left=20] node[left] {$\frac 13$} (Arm2) {};
        \draw[MidArrow] (Centre) to [bend left=20] node[below] {$\frac 13$} (Arm3) {};
    
        \draw[MidArrow] (Arm1) to [bend left=20] (Centre) {};
        \draw[MidArrow] (Arm2) to [bend left=20] (Centre) {};
        \draw[MidArrow] (Arm3) to [bend left=20] (Centre) {};


        \draw[dashed] circle[radius=0.3];
    \end{scope}
    \end{tikzpicture}
    \caption{A strategy space (left) over a star-graph with three leaves (right). The hidden states over the center allow for construction of more sophisticated strategy, as described in Section~\ref{sec:example strategies}.}
    \label{fig:y-space}
\end{figure}

We consider two defense strategies: (1) the patrolling unit is governed by a Markov chain defined on the physical space; (2) the patrolling a Markov chain defined on the strategy space.
In both cases the optimal Markov chain selects its actions with uniform probability.
However, the expected payoff for the defender playing strategy (1) is $\frac 13$ and it increases to $\frac 12$ when he switches to strategy (2).

\subsection{A 5-cycle}
Consider a $5$-cycle as presented in Figure~\ref{fig:five-cycle}. Let all nodes be targets with the same value and attack time equal 2. 

Let us consider an attacker with memory of length 2. According to results from exact solver, defender's strategy with hidden states is able to achieve up to $\frac{1}{3}$ capture probability, as compared to $\frac{1}{4}$ for one without memory. An example of optimal strategy for the defender is moving in sequences of two edges clockwise or two edges counterclockwise. After a sequence, he should randomly choose the direction of the next by picking the same as previously with probability $\frac{1}{3}$ and reverse with probability $\frac{2}{3}$.

\begin{figure}[H]
    \centering
    \begin{tikzpicture}[
    Circ/.style={circle, draw, inner sep=2pt, minimum size=2pt},
    MidArrow/.style={
        draw, postaction={decorate,decoration={markings,mark=at position 0.5 with {\arrow{>}}}}}
    ]
    \node[Circ, label=right:$A$](A) at (0:2) {};
    \node[Circ, label=above:$B$](B) at (72:2) {};
    \node[Circ, label=left:$C$](C) at (144:2) {};
    \node[Circ, label=left:$D$](D) at (216:2) {};
    \node[Circ, label=below:$E$](E) at (288:2) {};

    \draw[MidArrow](A) to [bend left=20] node[auto] {1} (B) {};
    \draw[MidArrow](A) to [bend left=20] node[auto] {1} (E) {};
    \draw[MidArrow](B) to [bend left=20] node[auto] {1} (C) {};
    \draw[MidArrow](B) to [bend left=20] node[auto] {1} (A) {};
    \draw[MidArrow](C) to [bend left=20] node[auto] {1} (D) {};
    \draw[MidArrow](C) to [bend left=20] node[auto] {1} (B) {};
    \draw[MidArrow](D) to [bend left=20] node[auto] {1} (E) {};
    \draw[MidArrow](D) to [bend left=20] node[auto] {1} (C) {};
    \draw[MidArrow](E) to [bend left=20] node[auto] {1} (A) {};
    \draw[MidArrow](E) to [bend left=20] node[auto] {1} (D) {};
    \end{tikzpicture}
    \caption{A physical space based on a 5-cycle.}
    \label{fig:five-cycle}
\end{figure}

Notice that the attacker will always benefit from attacking target two edges away from defender's current position. Assume that defender's last move was from $E$ to $A$. Then the probability that defender will move to $B$ and then to $C$ is the probability that when he finishes the sequence, he chooses to continue going counterclockwise, which is $\frac{1}{3}$. The probability that defender will move to $E$ and then to $D$ is the probability that he just finished sequence of two moves and now starts going clockwise, which is $\frac{1}{2}\cdot\frac{2}{3}=\frac{1}{3}$. This shows that capture probability is $\frac{1}{3}$.

\subsection{A \texorpdfstring{$2n+1$}{2n+1}-cycle}
The above strategy can actually be generalized. Let $\pspace$ be a cycle with $2n+1$ vertices. Assume all of them are targets with the same value and the attack time is equal to $n$. 

As above, suppose the attacker has memory of length 2. Consider the following strategy: make $n$ steps in one direction (randomly choosing the clockwise direction or the anti-clockwise one), and then randomly choose the direction of the next sequence of $n$ moves by drawing the same direction as for the previous sequence with probability $\frac{1}{n+1}$ and reverse the direction with probability $1-\frac{1}{n+1}$. 

Again, the attacker benefits most from attacking a target that is $n$ edges away from defender's current location. Thus, by the same reasoning as before, this strategy is quaranteed to give the defender the capture probabiltiy equal to $\frac{1}{n+1}$ against a rational attacker.

\subsection{Comparison with game values computed by \texorpdfstring{\cite{AIJ2023}}{}}

 We can compare game values of the model from \cite{AIJ2023} to the ones we compute with our model. Over there, the game value is defined as:
$$V_I(\mu) = \inf_{\sigma} \sum_{i=1}^\infty \gamma^{i-1} P_{b, \mu, \sigma}(s^{(i)})R(s^{(i)}a_1^{(i)}a_2^{(i)}),$$
where $s^{i}$ is the $i$-th state of the game (in our terminology: the $i$-th element of the patrol schedule), $\sigma$ is the attacker strategy, $P_{b, \mu, \sigma}$ is the probability of the state in the $i$-th round of the game being $s$ , depending on the initial distribution $b$, and the players strategies, $a^{i}_j$ are actions of $j$-th player in the $i$-th round, and $R$ are payoffs of the defender for a given round. If we apply similar finiteness and invariance assumptions, as we have above, to this model in \cite{AIJ2023}, set discount factor $\gamma = 1$, assume that actions of attacker do not affect the defender, but that his activity is implicit in the objective function, and replace the sum of $R$-s with $G_j(p)$ then
the game value from~\cite{AIJ2023} becomes in the notation of our model:
$$\gamevalue_I(\strategy) = 
\inf_{j \in \targets} \sum_{i \in I} \sum_{p \in L^{\tau_j}} \behavioral(ip) \payoff_j(p).$$ That means that we compute game value as a worst case scenario, while \cite{AIJ2023} computes game value as an average. We believe that in security scenarios the former is more adequate. 

\newpage
\section{Constructing spaces with memory}\label{sec:construction}

%Recall how our hitherto structure has been built. Given a patrolling setting, we construct a physical space $(L,R)$, and define the strategy space a state and action space $(S,A)$ (allowing to choose more sophisticated strategies then the ones available in the physical space), together with a homomorphism $X: S \rightarrow L$. Given the strategy space, we can construct a space with memory. Having the set $\mathcal{X}$ of bi-infinite sequences over $(L,R)$, we assume that the attacker may observe an arbitrarily long substring of a sequence $x$ of moves played by the defender, and then he chooses and fixes a natural number $h$ (as defined in section \ref{sec:history matching}) that denotes the length of the observed sequence of moves of the defender that the attacker will take into account while computing the frequencies of the defender's visits to the states of the game and trying to infer the strategy of the defender. This results in the following: when the defender proceeds via a sequence $s$ of his internal strategy states, the attacker observes a sequence of physical states $X \circ s$. Then, having the natural number $h$ fixed, we put $Y_t: \mathcal{X} \rightarrow \mathbb{Y}$ at a time $t$ to be the sequence $(x_{t-h+1}, x_{t-h+2}, \ldots, x_t)$. Further, if we assume that the space has a memory of at least $h-1$ with respect to $X$, we may lift the attacker's observation $Y$ to be a function defined on states in $S$. Provided each vertex in $S$ has a positive in-degree, by Lemma \ref{lem:his} we know that the function $M_X$ is well-defined, and we have thus obtained a strategy space with memory. 

There are many ways of constructing strategy spaces with memory. For instance, given a physical space $\pspace$, and a positive integer $k$, we may define the states of $S$ to be $k+1$-tuples of nodes from $\pstates$, interpreted as as the current position of the patrolling unit, together with $k$ positions visited immediately before. Then, $|S| = \pstates^{k+1}$, and $S$ has memory of length $k$ with respect to $X$. For $m \leq k$ the attacker can then observe the current position of the unit and its previous $m$ locations.\footnote{The formal requirement for this to be well defined is for the space to have memory of length $m-1$ with respect to $X$.} This construction is illustrated in the section \ref{sec:memory_app} of the Appendix. 


%To see examples of specific constructions of spaces with memory (which are instances of general methods of building these spaces), first consider the physical space $(L,R)$ to be the cycle graph $C_5$ with 5 vertices $\{l_0, \ldots, l_4\}$, and edges in both directions, i.e., $R=\{(l_i, l_{i+1}), (l_i, l_{i-1}): i =0, \ldots, 4\}$, where the addition and subtraction are defined modulo 5. Suppose all edges have equal length 1, and that each vertex stores a target of equal positive value. Now assume we have one patrolling unit and define $S:=L^3$, i.e., the states of $S$ are triples of vertices, interpreted as the current position $l$ of the unit, together with the two vertices visited by the unit immediately before $l$, where if e.g.,  $s_i = (l_1,l_3, l_0)$, then it means the current position of the patrol is $l_1$, to which it arrived from $l_3$, and one step earlier it was in $l_0$. In other words, the states of the strategy space can be identified with paths of length 2. Obviously, $|S|=125$. We may define the attacker's observation $Y: S \rightarrow \mathbb{Y}$ in such a way that $\mathbb{Y}=L^2$, for any $s_i = (l_{i_1}, l_{i_2}, l_{i_3}) \in S$ define $Y(s_i)=(l_{i_1}, l_{i_2})$, that is the attacker observes the current position of the patrolling unit and its previous location (the attacker, observing the physical space, remembers the location from which the patrolling unit came to the current position). The formal requirement for this to be well-defined is for the space has memory of length 1 with respect to $X$. The space $S$ has actually memory equal to the length of the paths. Assume now that the sequence of physical states of the patrol is the repeated cycle $l_0, l_1, l_2, l_3, l_4$.  Then, in the strategy space, the sequence of internal strategy space generated by the defender will be $s=((l_0, l_4, l_3), (l_1, l_0, l_4), (l_2, l_1, l_0), \ldots)$.
%Then, the attacker's observations will be $((l_0, l_4), (l_1, l_0), (l_2, l_1), \ldots)$. This is an instance of a strategy space, where each node can be identified with with a path of (given fixed length) nodes from the physical space 

%It is worthwhile to observe in this place that if the strategy space $S$ is a cyclic graph (even when we forget about the direction of the edges), then if for a given natural number $n$, the girth of the graph, i.e., the length of a shortest cycle contained in the graph, is greater or equal than $2n$, then the space has memory at least $n$ (it can be actually larger). 

Consider another example of a strategy space with memory. Given an arbitrary physical space $\pspace$, we can construct the strategy space as the set of mutually disjoint finite cycles $\{C_{k_i}: i = 1, \ldots, |\pstates|\}$, each of length $k_i$. Such a space is equivalent to selecting a mixed strategy in a Stackelberg game where pure strategies are patrols from a predetermined set, similarly to the models built in recent years in many widely applicable works on security games \cite{sinha2018stackelberg}. Since in such a space almost all moves of the defender but the initial one are deterministic, the memory of the space (with respect to $X$) is actually infinite, despite the fact that the size of the space can be relatively small. 

%Further, consider constructing the strategy space $S$ from the physical space graph $(L,R)$ by adding additional internal states through taking a tensor product of the input graph with an arbitrary graph $G$. In particular, the graph that is multiplied by $L$ might even be the complete graph and the length of memory then depends on  $G$, however as a result of the tensor multiplication the length of the memory of the space cannot decrease. 

Further, consider constructing the strategy space by taking a tensor product of the physical space $\pspace$ and an arbitrary graph $G$. In particular, the graph $G$ might be a clique, in which case it can be seen as an internal memory of the defender. Aside from knowing their location in the physical space, the defender can use $G$ to store an additional piece of information, with the number of distinct states equal to the number of nodes in $G$. The length of memory then depends on $G$, notice however that as a result of the tensor multiplication the length of the memory cannot decrease.

%The above illustrates how our model using Markov chains for constructing the strategy of the defender combines with the approach to adversarial patrolling based on modelling them via Stackelberg security games -- we enrich the game-theoretic framework by careful construction of the strategy space with memory.

\subsection{Space with Memory over the 5-cycle}\label{sec:memory_app}

To see a concrete example of a specific construction of space with memory, the physical space $(L,R)$ to be the cycle graph $C_5$ with 5 vertices $\{l_0, \ldots, l_4\}$, and edges in both directions, i.e., $$R=\{(l_i, l_{i+1}), (l_i, l_{i-1}): i =0, \ldots, 4\},$$ where the addition and subtraction are defined modulo 5. Suppose all edges have equal length 1, and that each vertex stores a target of equal positive value. Now assume we have one patrolling unit and define $S:=L^3$, i.e., the states of $S$ are triples of vertices, interpreted as the current position $l$ of the unit, together with the two vertices visited by the unit immediately before $l$, where if e.g.,  $$s_i = (l_1,l_3, l_0),$$ then it means the current position of the patrol is $l_1$, to which it arrived from $l_3$, and one step earlier it was in $l_0$. In other words, the states of the strategy space can be identified with paths of length 2. Obviously, $|S|=125$. We may define the attacker's observation $Y: S \rightarrow \mathbb{Y}$ in such a way that $\mathbb{Y}=L^2$, for any $$s_i = (l_{i_1}, l_{i_2}, l_{i_3}) \in S$$ define $$Y(s_i)=(l_{i_1}, l_{i_2}),$$ that is the attacker observes the current position of the patrolling unit and its previous location (the attacker, observing the physical space, remembers the location from which the patrolling unit came to the current position). The formal requirement for this to be well-defined is for the space has memory of length 1 with respect to $X$. The space $S$ has actually memory equal to the length of the paths. Assume now that the sequence of physical states of the patrol is the repeated cycle $l_0, l_1, l_2, l_3, l_4$.  Then, in the strategy space, the sequence of internal strategy space generated by the defender will be $$s=((l_0, l_4, l_3), (l_1, l_0, l_4), (l_2, l_1, l_0), \ldots).$$
Then, the attacker's observations will be $$((l_0, l_4), (l_1, l_0), (l_2, l_1), \ldots).$$ This is an instance of a strategy space, where each node can be identified with with a path of (given fixed length) nodes from the physical space. 

It is worthwhile to observe in this place that if the strategy space $S$ is a cyclic graph (even when we forget about the direction of the edges), then if for a given natural number $n$, the girth of the graph, i.e., the length of a shortest cycle contained in the graph, is greater or equal than $2n$, then the space has memory at least $n$ (it can be actually larger). 



\begin{comment}
Our model extends the approach of the model formulated in the paper \cite{john2023rosso} that analyzes stochastic surveillance strategies of randomized patrolling robots that are represented by a Markov chain subordinate to an environment $H=(L,R)$, where the set of nodes $L$ represents points of interest,and the set of edges $R$ represents routes between these points. A surveillance agent patrols a graph according to a Markov chain $P$, whereas the attacker chooses a single node $l_j$ and remains stationary there. The surveillance agent captures the attacker if she visits the attacker's node within $\tau_j \in \mathbb{N}$ time periods. Otherwise, the attacker succeeds. The (stationary, if $P$ is irreducible) probability distribution $\pi$ of the Markov chain $P$ can be extended to probability measure on patrol schedules, i.e., infinite paths over the graph $H$. Therefore, in  the language of our model, the defender strategy is the probability measure $\mu_P$ over $\mathcal{L}$ derived from the Markov chain $P$. The set of attack plans of the game above can be identified with the set of targets $T \subseteq L$ (each target being one of the vertices in the graph), where each target $l_j$ implicitly contains the information on the attack-time $\tau_j$ of the vertex $l_j$. The set of strategies of the attacker is $\{(\lambda, l_j): \lambda \in L^\ast, l_j \in L\}$,  where $\lambda$ is a (finite) sequence of nodes of $L$ visited by the patrolling unit until the moment of the attack. %To see how the model described is a special case of our framework, 
Observe that given the attacker knows the Markov chain $P$, what matters to them is the last element of the sequence $\lambda$, thus the set of strategies could be actually simplified to the form  $\{(l_i, l_j): l_i, l_j \in L\}$, where $l_i$ is the point of interest where the surveillance unit is located when the attack begins. If we let $X_k \in \{1, \ldots, n\}=[n]$ to be the value of  $P$ at time $k$, i.e., the node visited at time $k$, then $t_{ij}(P) = \min\{k: X_0 = l_i, X_k=l_j, k \geq 1 \}$ is a random variable representing the number of time periods between the agent leaving the node $l_i \in L$ and their arrival to node $l_j \in L$. The payoff of the defender $G_j(p)$ is equal to 1, if $t_{ij} \leq \tau_j$, and 0  otherwise, where $p_0=l_i$, i.e., if at the time 0 of the schedule $p$ the defender is located in the node $l_i$. The game value of the strategy $\mu$ of the defender is then simply $V(\mu) = \min_{l_i \in L} \min_{j \in T} \mu\left(t_{ij} \leq \tau_j\right)$.
\end{comment}



\clearpage
\section{Supplementary Figures}

\begin{figure}[tbh]
    \centering
	\includegraphics[width=.7\linewidth]{figures/er2_1_util_time}    
    \caption{The left plot presents the mean utility of the defender, while the right plot presents the mean runtime. Each data point is an average over $100$ \ERn networks. The colored areas (very narrow in most cases) represent $95\%$ confidence intervals.}
    \label{fig:er-simulations}
\end{figure}

\begin{figure}[tbh]
    \centering
	\includegraphics[width=.7\linewidth]{figures/ws2_1_util_time}    
    \caption{The left plot presents the mean utility of the defender, while the right plot presents the mean runtime. Each data point is an average over $100$ \WSn networks. The colored areas (very narrow) represent $95\%$ confidence intervals.}
    \label{fig:ws-simulations}
\end{figure}



\newpage
\section{Glossary}

\renewcommand{\glossarysection}[2][]{}

\glsaddall
\printglossaries

\end{document}
