\section{Heuristic Graph Search Value Iteration}

In this section, we present our algorithm, called Heuristic Search Value Iteration for Reachability Probabilities (HSVI-RP). The algorithm explores the search space by incrementally constructing a belief MDP graph $G$. The nodes in $G$ are allowed to have have multiple parents to alleviate the aforementioned issue of loops in trial-based search. 
The initial node of $G$ is the initial belief $b_0$. 
From $b_0$, the algorithm incrementally unrolls a finite fragment of the full belief MDP through trial-based search, and maintains sound two-sided bounds on maximal reachability probability. The two-sided bounds are used to inform the direction of search, detect $\epsilon$-optimality, and bound the optimal solution when used in an anytime manner. These bounds also have the benefit that bound improvements in one part of the belief space also improve the bounds in other parts of the belief space.

HSVI-RP is outlined in Algorithm~\ref{alg:HSVI-RP} and depicted in Figure~\ref{fig:overview}. At each iteration, a depth-first trial is conducted. An action is heuristically selected at belief $b$, and all successor beliefs $b'$ and their transitions are added to the graph. To select the next belief in the trial, an observation is heuristically selected. When a trial is terminated, we perform Bellman backups over the selected belief nodes. We also perform exact value iteration on $G$ periodically to improve upper bounds. The key is to search the belief MDP efficiently by expanding belief nodes that may be part of the reachable space of an optimal policy. To do so, we maintain and update a set of upper and lower bounds. Each graph node has an associated upper and lower bound value computed from these sets. Then, we propose new search heuristics that take advantage of these bounds and the structure of Problem~\ref{problem}.

The algorithm can be seen as a modification of discounted POMDP trial-based search to address their drawbacks for MRPP. The following summarize our key modifications: 
\begin{itemize}
    \item We do not use discounting. Instead of terminating trials with $\gamma^{-t}$, we use an adaptively increasing search depth to incrementally increase the depth of explored beliefs.
    \item We represent the search space as a graph by merging belief states that already exist in the graph. 
    \item We propose new trial-based expansion heuristics to handle indefinite horizon graph search.
    \item To enable the improvability of two-sided bounds while maintaining tractability, we use a combination of local Bellman backups and exact value iteration on $G$.
\end{itemize}
Below, we provide details on the algorithm.

\begin{algorithm}[t]
    \caption{\texttt{HSVI-RP($\mathcal{M}, \epsilon, \mathrm{T}$)}}
    \label{alg:HSVI-RP}
    \textbf{Global}: $\mathcal{M}, G, V^L, \Upsilon^U$
    \begin{algorithmic}[1]
    \STATE Initialize $G$ with $b_0$
    \STATE Initialize $V^L$ with blind policy
    \STATE Initialize $V^U = \Upsilon^U$ with $V_{MDP}$
    \WHILE{$V^U(b_0) - V^L(b_0) > \epsilon$}
            \FOR{n iterations}
                \STATE \texttt{EXPLORE}($b_0, 0, V^U(b_0) - V^L(b_0), d_{\text{trial}}$)
            \ENDFOR
            \STATE $\forall b \in G, V^U(b) = V^L(b)$
            \STATE $\forall b \in F, V^U(l) = \Upsilon^U(b)$
            \STATE Perform Value Iteration for Upper Bounds on belief MDP $G$ to obtain a new $\Upsilon^U$
            \STATE Increase $d_{\text{trial}}$ if no improvements for $i$ iterations
        \ENDWHILE
        \STATE \textbf{return} $G, V^L, \Upsilon^U$
    \end{algorithmic}
\end{algorithm}

\begin{algorithm}[t]
    \caption{\texttt{EXPLORE}($b, t, \epsilon, d_{\text{trial}})$}
    \label{alg:Explore}
    \textbf{Global}: $\mathcal{M}, G, \Upsilon^U, V^L, \kappa$
    \begin{algorithmic}[1]
        \IF{$V^U(b) - V^L(b) \leq \kappa\cdot \epsilon \;\; \textbf{ or } \;\; t > d_{\text{trial}}$}
            \STATE \textbf{return}
        \ENDIF
        \STATE $A' \gets \{a \; : \; \underset{a'}{\max} \,Q^U(b,a') - Q^{U}(b,a) < \xi\}$
        \STATE $a^* \gets \underset{a \in A'}{\argmax}\left[Q^U(b,a) + {c_{a}\sqrt{N(b)}}/{\big(1 + N(b,a)\big)} \right]$
        \STATE $o^* \gets \underset{o}{\argmax} \left[\text{WEU}(b, t, \epsilon) + P(o|b, a^*) \frac{c_z \sqrt{N(b,a^*)}}{1 + N(b')} \right]$
        \STATE Add all $b'$ from taking $a^*$ at $b$ to belief MDP graph $G$
        \STATE Update $b_{t+1}$ using $a^*, o^*$ with Eq.~\eqref{eq:beliefupdate}
        \STATE \texttt{EXPLORE}($b_{t+1}, t + 1, \epsilon, d_{\text{trial}}$)
         \STATE Perform local updates on bounds $\Upsilon^U, V^L$ at belief $b$
         \STATE \textbf{return}
    \end{algorithmic}
\end{algorithm}


\paragraph{Lower and Upper Bounds}
An important reason for the effectiveness of trial-based search is the use of bounds that allow improvements in one part of the belief space to improve the bounds in other parts of the belief space. Thus, we utilize bound representations that have this property.

We use a set $\Gamma$ of $\alpha$-vectors for lower bound representation. To initialize sound lower bounds, we use the blind policy \citep{kochenderfer2022algorithms} by taking some $i \geq 0$ number of steps, yielding lower bound on reachability probabilities. This set of $\alpha$-vectors also represents the policy for execution. Since $V^*(b) \geq V^{\pi}(b) = \max_{\alpha \in \Gamma}(\alpha^T b)$, the action at belief $b$ is chosen using $\arg\max_{\alpha \in \Gamma}(\alpha^T b)$.

We use a belief point set $\Upsilon^U$ to represent upper bounds. The upper bound value $V^U(b)$ at any belief $b$ is the projection of $b$ onto the convex hull formed by $\Upsilon^U$ of belief-value points $(b_i, V^U(b_i)$. We denote this projection as $\Upsilon^U(b)$, where $V^U(b) = \Upsilon^U(b)$. To initialize sound upper bounds, we use the $V_{MDP}$ method \citep{hauskrecht2000value}, which uses optimal values obtained on the fully observable underlying MDP. The MDP optimal value function provides values at the corners of the belief simplex, which are the initial points in the upper bound point set. These bounds can be further improved using the $Q_{MDP}$ or Fast Informed Bound methods. An upper bound for a belief is computed using an LP or a sawtooth approximation \citet{hauskrecht2000value}.


\paragraph{Value Updates}

The Bellman update equation allows us to update and improve the bounds through dynamic programming. The Bellman operator $\mathbb{B}$ is defined as:
\begin{align}
    Q(b,a) &= R(b,a) + \mathbb{E}[V(b_{t+1})],\\
    [\mathbb{B}V](b) &= \max_a Q(b,a) \;\;\;\forall b \in B.
\end{align}
$\mathbb{B}$ is defined over the entire belief space. For discounted POMDPs, it has been shown that performing an asynchronous local Bellman update over the belief states sampled in each trial is more efficient:
\begin{align}
    [\mathbb{B}V](b) &= \max_a Q(b,a) \quad \forall b \in B_{trial} \subseteq B.
\end{align}
\noindent Our trial backup step performs asynchronous local Bellman update for both lower and upper bounds. Each application of $\mathbb{B}$ on a belief state adds an $\alpha$-vector to $\Gamma$ \citep{Trey2004HSVI}, and updates the upper bound point set. As more of the belief space is explored during graph search, successive local updates leads to uniform improvement in the lower bounds and propagates improvements in the upper bound. Asynchronous local backups over trial sampled beliefs are effective for improving lower bounds.

\paragraph{Periodic Exact Upper Bound Value Iteration}
\label{sec:exactvalueiteration}

Unlike for discounted POMDPs, local backups over the upper bound point set may not lead to improvements of the upper bounds. Bellman backups over an upper bound for a POMDP may never improve because the Bellman operator for MRPP (which is reducible to the stochastic shortest path problem) is a semicontractive model \citep{bertsekas2022abstractdp}. Consequently, there may be many fixed points for value function upper bound, with the optimal upper bound solution being the \emph{least fixed point}, denoted by 
lfp[$V^U$].

Intuitively, this issue arises when there are loops, and thus states in end components may have upper bound values higher than lfp$[V^U]$. 
This may cause a Bellman update to not decrease the upper bound value. Consider the example in Figure~\ref{fig: counterexample} again. If $b_4$ is added to the graph, and the upper bound value of $b_3$ is updated via Bellman backup to a value below $1$, the backup for upper bounds at $b_2$ (and hence $b_1$) will not decrease in value because action $b$ gives the highest upper bound value of $1$. When using local backups over the upper bound set, backups over belief states that are in an end component may not improve their upper bound values.

In finite state MDPs, when initialized with a suitable under-approximate value function, value iteration (VI) converges to lfp$[V]$ \emph{from below} \citep{hartmanns2020optimistic}. By treating the frontier nodes of the partially explored belief MDP as an ``upper bound target set'' and via a suitable initialization, we can achieve a similar result for upper bounds for a given $G$.

Let $F$ be the set of frontier nodes of $G$. An upper bound on the maximal probability of reaching $\mathrm{T}$ by first going to a belief node $b \in F$ is
\begin{equation*}
    P^{\pi^*}_{G}(\lozenge \mathrm{T}) = \max_{\pi}\{\mathbb{E}_{b \in F}[P^{\pi}_{G}(\lozenge b) + V^U(b)]\} \geq P^{\pi^*}_{M}(\lozenge \mathrm{T}).
\end{equation*}
Thus, this reduces to computing maximal values, given upper bound values of the frontier nodes. 

What remains is a suitable initialization. 
Intuitively, we want to fix $V^U$ for the frontier nodes, and under-approximate it for all the other nodes.
Hence, for each $b \in F$, we set $V^U(b)$ to $\Upsilon^U(b)$.   
For all other nodes, the upper bound values are set to an under-approximation, such as their lower bounds, i.e., $V^U(n) = V^L(n)$. Then, VI of the upper bound values over $G$ until convergence obtains a new upper bound point set $\Upsilon^{U'}$, which is the least fixed point for $G$ and $\Upsilon^U$.

While VI is crucial to improve upper bounds, it has large computation overhead as VI has to be conducted for all nodes in the belief graph. Therefore, we periodically re-initialize the upper bound values to re-compute lfp$[V^U]$ over $G$ as more beliefs are added to $G$. This allows continual improvement of the upper bound over iterations by reusing the least fixed point from previous iterations instead of starting from a loose upper bound. As more of the belief space is expanded, the upper bound values improve towards $P^{\pi^*}_{M}(\lozenge \mathrm{T})$. 

\paragraph{Trial-based Graph Exploration}

Here, we present a trial-based belief exploration technique modified for graphs in MRPP. We also propose a technique to handle loops.

\textit{Action Selection: \quad}
As discussed in Section~\ref{sec: HSVI2problems}, many actions may have the same upper bound value, so HSVI2's action selection method is ineffective. We propose an action selection heuristic based on the Upper Confidence Bound (UCB) \citep{coquelin2007bandit}, considering the upper bound $Q$ values plus a term based on the number of times that action has been selected. We additionally only consider actions that have upper bound values within some user-specified action selection radius $\xi$ from the highest upper bound:
\begin{align}
    \label{eq:actionselection}
    &A' = \{a \; : \; | Q^{U}(b,a) - \arg\max_{a'}Q^U(b,a') | < \xi\},\\
    &a^* = \arg\max_{a \in A'}\left[Q^U(b,a) + {c_{a}\sqrt{N(b)}}/{(1 + N(b,a))} \right] \nonumber
\end{align}
where $c_a$ is an exploration constant, and $N(b)$ and $N(b,a)$ are the number of times $b$ has been visited, and action $a$ has been chosen at $b$ respectively. This heuristic incentivizes exploration of other actions that have similar upper bounds. A lower $\xi$ favor actions with higher upper bounds, improving upper bounds faster, but may reduce efficiency by limiting exploration.

\textit{Observation Selection: \quad}
Similarly, for observation selection, just using Eq.~\eqref{eq:HSVI2heuristic} is ineffective, as the same sequence of observations may be repeatedly chosen at a belief. Instead, we use a heuristic that is weighted based on the Excess Uncertainty, probability of reaching that observation, and number of times $N(b')$ the resulting belief has been chosen.
\begin{align}
    \label{eq:observationheuristic}
    o^* \gets \arg\max_{o} \left[\text{WEU}(b, t, \epsilon) + P(o|b, a^*) \frac{c_z \sqrt{N(b,a^*)}}{1 + N(b')} \right]
\end{align}
where $c_z$ is an exploration constant, and $N(b,a^*)$ is the number of times action $a^*$ has been chosen at node $b$. This heuristic incentivizes choosing successor belief states that have not been explored often.

Higher values of of $c_a$ and $c_z$ encourage more exploration but can hinder convergence if they are set too high due to too much exploration and too little exploitation.

\begin{remark}[Observation Heuristic Randomization]
    \label{remark:observation mixing}
    A heuristic that mixes between Eq.~\eqref{eq:HSVI2heuristic} and Eq.~\eqref{eq:observationheuristic} also works well empirically. When randomization is used, with probability $p$, we randomize between using Eq.~\eqref{eq:HSVI2heuristic} and \eqref{eq:observationheuristic}.
\end{remark}

Additionally, to address the problem of loops in graph search, we also keep track of the beliefs, actions, and observations that have been sampled during a trial, to not repeatedly choose the same sequences of actions and observations. When an action is selected, we only consider observations that do not lead to beliefs that are already sampled during the trial. Such beliefs are part of loops. If no observations are available from that action, we avoid selecting that action, and consider another action instead. If no more actions are available, we skip to the next belief in the sampled sequence that is part of the loop, and continue the trial. The trial ends if all beliefs have no actions available. Alternatively, one can maintain a global history list to not select the same histories more than once, but maintaining this history list may be ineffective if many histories end up in the same beliefs, and can be computationally demanding due to the number and length of histories in an indefinite horizon problem.

\textit{Adaptive Trial Termination: \quad}
We define a maximum depth $d_{\text{trial}}$ for each trial, that is increased adaptively as the number of iterations increase. We use a simple heuristic to increase $d_{\text{trial}}$. $d_{\text{trial}}$ is increased by $d_{\text{inc}}$ when neither the bounds have changed by at least $0.01$ over $n$ successive trials. A trial terminates when either of two conditions hold: $t > d_{\text{trial}}$ or $V^{U}(b_t) - V^{L}(b_t) \leq \kappa\cdot(V^U(b_0) - V^L(b_0))$ for $0 < \kappa < 1$. Parameters $d_{trial}$ and $d_{inc}$ control the rate of increase of search depth. Higher values are beneficial for long horizon problems but may slow search efficiency if increased too quickly.

\paragraph{Pruning}
The size of the $\alpha$-vector set affects backups significantly. To keep the problem tractable as more of the search space is expanded, we prune dominated elements in the lower bound $\alpha$-vector set in a manner similar to HSVI2. $\alpha$-vectors are pruned when they are pointwise dominated by other $\alpha$-vectors. Pruning is conducted when the size of the set has increased by $10\%$ since the last pruning operation.