\section{Proof of Lemma 1}

\begin{proof}
    We first show that the initial upper and lower bounds are sound. 
    
    We initialize the lower bound $\alpha$-vectors with a blind policy for $i$ steps, which is a lower bound on the maximal reachability probability. Upper bounds are initialized with the $V_{MDP}$ technique, which assumes that there will be full observability after taking the first step. Since we can only do better if we have full observability, the computed value function is an upper bound on the optimal value function  \citep{kochenderfer2022algorithms}. Therefore, the initialized bounds are sound.

    Now, we show that iterations of belief exploration and backups preserve soundness of the bounds. 

    An $\alpha$-vector obtained from a Bellman backup of an $\alpha$-vector set is proven to remain a lower bound as long as the $\alpha$-vector set is a lower bound \citep{kochenderfer2022algorithms}.

    A Bellman backup of an upper bound point $V^U$ is:
    \begin{align*}
        V^{U'}(b) &= \max_{a \in A}\big(\mathbb{E}(R(s,a)) + \sum_{o}Pr(o|b,a)V^U(\tau(b,a,o))\big)
        \geq \max_{a \in A}\big(\mathbb{E}(R(s,a)) + \sum_{o}Pr(o|b,a)V^*(\tau(b,a,o))\big)
        = V^*(b)
    \end{align*}
    i.e., the upper bound point remains an upper bound.
    
    Finally, we show that Exact Upper Bound Value Iteration preserves the upper bound property of the upper bound point set. Let $V^U(l)$ be an upper bound maximal probability of reaching $\mathrm{T}$ from $l \in L$ given $G$. The Exact Upper Bound Value Iteration computes the upper bound on the maximal probability of reaching $\mathrm{T}$ by first going to a node $l \in L$:
    \begin{align*}
        P^{\pi^*}_{G}(\lozenge \mathrm{T}) = \max_{\pi}\{\mathbb{E}_{l \in L}[P^{\pi}_{G}(\lozenge l) + V^U(l)]\} \geq P^{\pi^*}_{M}(\lozenge \mathrm{T}).
    \end{align*}
    Since Exact Upper Bound Value Iteration is initialized with a sound upper bound $\Upsilon_i^U$ at frontier nodes, convergence of value iteration implies that the new upper bound point set $\Upsilon_{i+1}^U$ is also an upper bound.
\end{proof}

\section{Proof of Theorem 1}

\begin{proof}

    Let the current trial depth be $t_{trial} = d > 1$. We show that our algorithm eventually expands all beliefs reachable within $d$ steps. 

    Consider belief $b_0$ at the root of the graph, which will always be selected during a trial. From the action selection method of Eq.~\eqref{eq:actionselection}, all actions are eventually selected infinitely often as $n \rightarrow \infty$ since the second term $c_a \cdot \frac{\sqrt{N(b)}}{1 + N(b,a)}$ is a strictly increasing function if $a$ is not selected. The observation selection method of Eq.~\eqref{eq:observationheuristic} behaves in a simlar manner, where $c_o \cdot \frac{\sqrt{N(b)}}{1 + N(b_{t+1})}$ is a strictly increasing function if observation $o$ is not selected. Therefore, all beliefs will be selected infinitely often as $n \rightarrow \infty$. Therefore, all beliefs reachable within $1$ step (depth $1$) will eventually be expanded.

    Next, during a trial, when a depth $1$ belief is reached, all actions and observations are again eventually selected infinitely often as $n \rightarrow \infty$. By induction, all beliefs reachable within $d$ steps will eventually be expanded.
    
    
    Suppose that the algorithm has searched all beliefs reachable within $d$ steps, and constructed the belief MDP $G_d$. Note that since belief MDP $G_d$ is a graph, policies over $G_d$ do not only include $d$-step trajectories, but also indefinite-horizon policies. Let $V^*_{G_d}(b_0)$ be the optimal value function for the belief MDP $G_d$ for Problem~\ref{problem}, i.e., maximal probability of reaching $s \in T$ \emph{only within $G_d$}. Thus,  
    \begin{align*}
        V^*_{G_d}(b_0) \leq V^*(b_0),
    \end{align*}
    since there may be $s \in \mathrm{T}$ reachable from $b_0$ that are not in $G_d$ (reachable within $d$ steps).
    
    Let $V^U_{0, \emptyset}, V^L_{0, \emptyset}$ be the upper and lower bounds at the initial iteration, and $V^U_{n, G_d}, V^L_{n, G_d}$ be the upper and lower bound fixed points computable with the belief MDP $G_d$ at iteration $n$, i.e., $V^L_{n, G_d}$ has converged to its fixed point (using asynchronous local updates) and $V^U_{n, G_d}$ has converged to its least fixed point (using Exact Upper Bound Value Iteration). At $b_0$, $V^L_{n, G_d}(b_0)$ upper bounds $V^*_{G_d}$, since the $\alpha$-vectors represent conditional plans that include the probability of reaching $s \in \mathrm{T}$ that are not in $G_d$,  
    \begin{align*}
        V^*_{G_d}(b_0) \leq V^L_{n, G_d}(b_0) \leq V^*(b_0) \leq V^U_{n, G_d}(b_0),
    \end{align*}

    Also, $d_{\text{trial}} \rightarrow \infty$ as $n \rightarrow \infty$. Assume that an optimal policy can be represented with a finite $N$-memory belief-based policy. Since our states, actions and observations are finite, this implies that there exists $M \geq N$ where there is a trial depth $t_{trial} = M$ such that for a finite sized belief MDP $G_M$
    \begin{align*}
         V^*_{G_M}(b_0) = V^*(b_0) \implies V^L_{n, G_M}(b_0) = V^*(b_0)
    \end{align*}

    Therefore, 
     \begin{align*}
        \lim_{n\rightarrow \infty}|V^*(b_0) - V^L_n(b_0)| = 0
    \end{align*} 
\end{proof}

The assumption that there exists a finite-memory is consistent with the results that the decision problem for POMDPs is undecidable, even in the discounted-sum case. Additionally, note that this does not only hold for POMDPS with a a finite reachable belief space, only that a finite belief MDP is sufficient to compute an optimal policy.

We remark that this proof considers the worst-case convergence of the algorithm, in which all beliefs and trajectories are expanded in an unbounded manner to reach an optimal solution. In practice, we can get near-optimal policies without needing to expand all possible nodes.

\section{Benchmark Problems}

\paragraph{Nrp8} 

This problem is a non-repudiation protocol for information transfer, introduced as a discrete-time POMDP model by \citep{norman2017verification}. The goal is to compute the maximum probability, of a malicious behavior, that a recipient $R$ is able to gain an unfair advantage by obtaining information from an originator $O$ while denying participating in the information transfer.

\paragraph{Crypt4}

This problem models the the dining cryptographers protocol as a POMDP \citep{norman2017verification}. A group of N cryptographers are having dinner at a restaurant. The bill has to be paid anonymously: one of the cryptographers might be paying for the dinner, or it might be their master. The cryptographers respect each other’s privacy, but would like to know if the master is paying for dinner. The goal is to know if the cryptographer's master is . See \citep{norman2017verification} for more details.

\paragraph{Rocks12}

The rock sample problem was considered for model checking by \citep{Bouton2020PointBasedModelChecking}. It models a rover exploring a planet, tasked with collecting rocks. However, the rocks can be either \texttt{good} or \texttt{bad} and their status is not directly observable. The robot is equipped with a long range sensor, but sensing rock states is noisy. The problem ends when the robot reaches an exit area, with the state labelled as \texttt{exit}. We consider the formula $\phi_2 = \lozenge \texttt{good} \wedge \lozenge \texttt{exit}$ from \citep{Bouton2020PointBasedModelChecking}.

\paragraph{Grid Avoid}

This is a classical POMDP problem introduced as a benchmark problem for MRPP by \citep{norman2017verification}. There is $1$ obstacle and $1$ target state in a $4 \times 4$ grid, and the goal is to reach the target state while avoiding the obstacle. In Grid-av 4-0.1, the agent has a $0.1$ probability of staying still when attempting to move to another grid. The agent has an initial belief distribution of being in any of the non-obstacle or target states. We extend the problem to a $10 \times 10 $ grid with $3$ obstacles in Grid-av 10-0.3, and the probability of staying still when attempting to move is increased to $0.3$. In Grid-av 20-0.5, there are $5$ obstacles and the probability of staying still is increased to $0.5$. In both Grid-10-0.4 and Grid-10-0.5, the agent has initial belief distribution of being in any of the non-obstacle states within the first $5 \times 5$ grid.

\paragraph{Drone}

In Drone N-R, the agent has to reach a target state in an $N\times N$ grid, while avoiding a stochastically moving obstacle. The obstacle is only visible within a limited radius $R$ \citep{Bork2020overapp}.  

\paragraph{Refuel}

In RefuelN, the agent goal is to reach a target state in an $N\times N$ grid. There is uncertainty in movement and its own position is not directly observable. There are static obstacles, and movement requires energy. The agent starts with $N-2$ energy, and each move action uses $1$ energy. Energy can be refilled at recharging stations.

\begin{table}[ht]
\centering
\begin{tabular}{|c|c|c|c|}
\hline
Model          & States & State-action pairs & Observations \\ \hline
Nrp8 & 125 & 161 & 41\\\hline
Crypt4 & 1972 & 4612 & 510\\\hline
Rocks12 & 6553 & $3 \cdot 10^4$& 1645\\\hline
Grid-av 4-0.1  & 17     & 62                 & 3            \\ \hline
Grid-av 10-0.3 & 101    & 389                & 3            \\ \hline
Grid-av 20-0.5 & 401    & 1580               & 3            \\ \hline
Drone 4-1      & 1226   & 3026               & 384          \\ \hline
Drone 4-2      & 1226   & 3026               & 761          \\ \hline
Refuel-06      & 208    & 565                & 50           \\ \hline
Refuel-08      & 470    & 1431               & 66           \\ \hline
Refuel-20      & 6834   & 25k                & 174          \\ \hline
\end{tabular}
\caption{Size of Benchmark Problems}
\label{tab: size}
\end{table}

\section{Algorithm Details and Parameters}

\paragraph{Discounted-Sum POMDP} We used the technique in \citep{Bouton2020PointBasedModelChecking} together with SARSOP \citep{Kurniawati-RSS08-SARSOP} (toolbox implementation in C++) to compute solutions. 

\paragraph{PRISM} We used the toolbox implemented by \citep{norman2017verification}. We varied the parameter \emph{resolution} and report the best results.

\paragraph{Overapp} We used the implementation in \citep{Bork2020overapp} in the toolbox STORM. We report the best results over the recommended parameters found in the paper.

\paragraph{STORM, PAYNT, SAYNT} We used the implementation of all three algorithms from the toolbox available in \citep{andriushchenko2023symbiotic}. This toolbox is implemented in C++. The STORM implementation has multiple parameter settings - cut-off, clip2, clip4, or expanding \{2, 5, 10, 20\} million belief states. We report the best results for each experiments from these parameters. We use the default parameters for PAYNT, which method searches in the space of increasing $k$-memory FSCs. We report the best result. We report the results using the parameters recommended in the toolbox for SAYNT. SAYNT outputs two values (one for STORM and one for PAYNT); we report the best value. Overall, the results are similar to those in the original publication of these algorithms. 

\paragraph{HSVI-RP} We use $c_{a} = 0.01$, $\xi = 0.1$, $c_{z} = 0.01$, initial $d_{\text{trial}} = 200, d_{\text{inc}} = 10, \kappa = 0.01$, and performed Exact Upper Bound Value Iteration every $10$ exploration trials for all experiments. For all experiments except Drone 4-1 and Drone 4-2, we used our proposed heuristic, so there is no randomization in the algorithm. For Drone 4-1 and Drone 4-2, we randomized (probability $0.5$) between our proposed heuristic and the original HSVI2 heuristic, and report the mean results over $10$ runs.

The benchmark problems have specifications in the form of co-safe LTL. We use the technique by \citeauthor{Bouton2020PointBasedModelChecking} to compute product automata and accepting product target states.

\paragraph{Evaluation Validity} This evaluation focuses on the potential for trial-based search to obtain policies with tight two-sided bounds for maximal reachabiltiy probabilities through comparisons with state-of-the-art methods. It is important to note that some of the algorithms are implemented in different toolboxes and programming languages with varying levels of code optimization. To mitigate some of the issues related to this, we conducted all experiments on the same CPU when possible. Nonetheless, we do not draw definite conclusions on the relative speed of each algorithm due to their implementation differences.

\section{Convergence Plots}

Figure~\ref{fig: evolution} plots the evolution of our two-sided bounds for the evaluated benchmarks. The dashed and dotted lines give the other algorithms' final results for comparison. We omit the bounds obtained by PRISM as they are largely uninformative.

\begin{figure}[ht]
    % \scalebox{0.99}{
    \centering
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/grid4_graph.pdf}
        \label{fig:sub1}
    \end{subfigure}
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/grid10_graph.pdf}
        \label{fig:sub2}
    \end{subfigure}

    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/grid20_graph.pdf}
        \label{fig:sub3}
    \end{subfigure}
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/drone41_graph.pdf}
        \label{fig:sub4}
    \end{subfigure}
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/drone42_graph.pdf}
        \label{fig:sub5}
    \end{subfigure}
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/refuel6_graph.pdf}
        \label{fig:sub6}
    \end{subfigure}

    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/refuel8_graph.pdf}
        \label{fig:sub7}
    \end{subfigure}
    \begin{subfigure}{0.44\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/refuel20_graph.pdf}
        \label{fig:sub8}
    \end{subfigure}
    \caption{Evolution of lower and upper bound values over time. Overapp computes upper bounds, while STORM, PAYNT, and SAYNT compute lower bounds.}
    \label{fig: evolution}
\end{figure}