\section{From Discounted-Sum to Reachability Probabilities}
\label{sec: HSVI2problems}

Given the success of trial based tree search algorithms in discounted POMDPs and their application to probabilistic reachability problems \citep{Bouton2020PointBasedModelChecking}, a natural question that arises is whether these algorithms can directly approximate MRPP well. As it turns out, they unfortunately lose their desired theoretical properties, and there are problems in which these algorithms perform poorly. We discuss the issues that arise when applying these algorithms to MRPP. We focus on HSVI2 \citep{Smith2005HSVI2}, but the arguments presented also hold for other trial-based discounted-sum POMDP algorithms, e.g., \citep{Kurniawati-RSS08-SARSOP, zhang2015please}.

\paragraph{Incorrect converged solution if $\gamma < 1$ is used.}

Discounting causes trial-based search to converge to an under-approximation of the optimal probability. This is a strict under-approximation for all but trivial problems. More importantly, the admitted upper bound is incorrect when $\gamma < 1$.

\begin{proposition}
    The optimal value $V^{\pi^\gamma}(b_0)$ computed via Eq.~\eqref{eq:probability expected total reward} with discount $\gamma < 1$ strictly under-approximates the optimal probability value $P^{\pi^*}_{\mathcal{M}}(\lozenge \mathrm{T})$ for Problem~\ref{problem} if it takes more than one time step to reach $\mathrm{T}$ in the POMDP.
\end{proposition}
\begin{proof}
    Let $\pi^*$ and $\pi^\gamma$ be the policies that maximize Eq.~\eqref{eq:probability expected total reward} with $\gamma = 1$ and $\gamma < 1$, respectively.  Then, 
    \begin{align*}
            V^{\pi^{\gamma}}&(b_0) = R(b_0,\pi^{\gamma}(b_0)) + \mathbb{E} \Big[ \sum_{t=1}^{\infty} \gamma^{t} R\left(b_{t}, \pi^{\gamma}(b_{t})\right) \Big]\\
            &< R(b,\pi^{\gamma}(b_0)) + \mathbb{E} \Big[ \sum_{t=1}^{\infty} R\left(b_{t}, \pi^{\gamma}(b_{t})\right)  \Big]\\
            &\leq R(b,\pi^{\gamma}(b_0)) + \mathbb{E} \Big[ \sum_{t=1}^{\infty} R\left(b_{t}, \pi^*(b_{t})\right) \Big] \leq P^{\pi^*}_{\mathcal{M}}(\lozenge \mathrm{T})
    \end{align*}
\end{proof}

Consider a problem instance such that the optimal policy requires $n$ steps to reach an optimal probability of $p$. Using $\gamma < 1$ gives an optimal value of $\gamma^n p$ in the worst case. Hence, discounting can lead to arbitrarily large under-approximations.

\begin{figure}
    \centering
    \scalebox{0.85}{
    \begin{tikzpicture}[->,shorten >=1pt,auto,node distance=1.25cm,semithick]
  \tikzstyle{every state}=[draw,circle,minimum size=1cm]
  \node[state] (1) {$b_1$};
  \node[state, right=of 1] (2) {$b_2$};
  \node[state, right=of 2] (3) {$b_3$};
  \node[state, dotted, right=of 3] (4) {$b_4$};
    \node[draw,dotted,fit=(1),label=above:$o_1$] {};
  \node[draw,dotted,fit=(2),label=above:$o_2$] {};
  \node[draw,dotted,fit=(3),label=above:$o_1$] {};
  \path (1) edge[loop below] node {a: 0.6} (1)
            edge node {a: 0.4} (2)
        (2) edge node {a : 1} (3)
            edge[loop below] node {c : 1} (2)
         (2) edge[bend left] node {b: 1} (1)
          (3) edge[loop below] node {a: 1} (3)
         (3) edge[dotted] node {b: 1} (4);
     \draw[dotted] (4) -- +(1,0);
    \end{tikzpicture}
    }
    \caption{Belief MDP with loops that HSVI2 is ineffective on. The lower and upper bound values are initially $[0, 1]$ for all belief states. $b_4$ is initially not yet explored.    }
    \label{fig: counterexample}
\end{figure}

\begin{figure*}[ht!]
    \centering
    \includegraphics{figures/Overview_Final.pdf}
    \caption{Overview of HSVI-RP. The algorithm incrementally constructs a belief \textit{graph} using trial-based search, and maintains upper and lower bound values for each belief node. In each trial, actions and observations are selected using a heuristic based on the upper and lower bound values to visit a sequence of belief nodes (orange). After each trial, value bounds for each visited node are updated using local Bellman backups. Every $n$ iterations, upper bound values for all nodes are reset, and frontier nodes (purple) are re-initialized using the upper bound value set $\Upsilon^U$, and value iteration is performed. This allows better improvement of upper bound values for MRPP.}
    \label{fig:overview}
\end{figure*}

\paragraph{Trials may not terminate for $\gamma = 1$.} When $\gamma < 1$, $\epsilon \gamma^{-t}$ is a strictly increasing and unbounded function, in which case algorithm HSVI2 is guaranteed to converge. However, when $\gamma = 1$, HSVI2 may not terminate for some POMDPs. Unsurprisingly, a similar phenomenon is shown to exist for Goal-POMDPs \citep{Horak2018GoalHSVI}, which is also an indefinite horizon problem. Consider the belief MDP in Figure~\ref{fig: counterexample}, with some initialized value function bounds (e.g., with the blind policy for the lower bound \citep{kochenderfer2022algorithms} and the $V_{MDP}$ method for the upper bound \citep{hauskrecht2000value}). Starting from $b_1$, Eq. \eqref{eq:iemax} chooses action $a$, and selects observation $o_1$ since it has the largest WEU, returning to $b_1$. The trial will thus be stuck at $b_1$ indefinitely, since the termination condition is never met.

\paragraph{Loops.} In a POMDP, it is possible to choose actions and observations such that the same belief states are repeatedly reached, i.e., a loop. This is depicted in Figure~\ref{fig: counterexample}, where taking action $a$ at $b_1$ and $b$ at $b_2$ is a policy that allows the agent to remain at $b_1$ and $b_2$. Without properly accounting for loops, the algorithm may either get stuck in a loop indefinitely during exploration, or can be otherwise ineffective due to repeatedly exploring the same sequences of beliefs. 

In a discounted POMDP, the IE-MAX heuristic of HSVI2 performs well, since an action with the highest upper bound will be revealed to be suboptimal if its upper bound eventually decreases below the upper bound of another action. However, in MRPP, due to the presence of loops, many actions may have similar or the same upper bound values at a given belief. The IE-MAX heuristic may repeatedly choose the same actions and be stuck in a loop indefinitely, and new beliefs at the frontier may not be expanded to improve the upper bounds. For example, in the Belief MDP in Figure~\ref{fig: counterexample}, from $b_3$, taking actions $a$ and $b$ has the same value, so $a$ may be chosen and $b_4$ is never expanded. Further, trial-based local Bellman backups for MRPP may not converge for upper bounds due to the presence of these so-called \emph{end components}.\footnote{An end component is a sub-belief-MDP with a set of belief states $B' \subseteq B$ for which there exists a policy that enforces, from any state in $B'$, only the states in $B'$ are visited infinitely often.}. We discuss this in detail in Section~\ref{sec:exactvalueiteration}.
