\section{Empirical Evaluation}

\input{tables/sarsop_vs_us}

\input{tables/benchmark_table}

In this section, we evaluate our proposed algorithm. We aim to answer the following questions in our evaluation.
\begin{itemize}
    \item[\textbf{Q1.}] \emph{How well do discounted trial-based POMDP algorithms perform for MRPP?} We study the effect of discounting on solution quality for indefinite horizon reachability. We use SARSOP \cite{Kurniawati-RSS08-SARSOP} together with \cite{Bouton2020PointBasedModelChecking}, with varying levels of discount factor.

    \item[\textbf{Q2.}] \emph{How does our approach compare to state-of-the-art belief-based approaches?} We compare our approach to those by PRISM \citet{norman2017verification}, STORM \citet{Bork2022underapproximating} and Overapp \citet{Bork2020overapp}. PRISM computes two-sided bounds using a grid-based discretization to approximate the belief MDP. STORM and Overapp compute lower and upper bounds, respectively, in a breadth first search manner.

    \item[\textbf{Q3.}] \emph{How does our approach compare to other state-of-the-art approaches?} We compare our approach to PAYNT \citet{Andriushchenko2022InductivePOMDP} and SAYNT \citet{andriushchenko2023symbiotic}. PAYNT is an inductive synthesis algorithm that searches in the space of (small-memory) FSCs. SAYNT is an algorithm that integrates both PAYNT and STORM by using the FSCs computed from one to improve the other in an anytime loop. 
\end{itemize}

\textbf{Benchmarks and Setup \quad} We implemented a prototype of HSVI-RP in Julia under the POMDPs.jl framework~\citep{egorov2017pomdps}, and used the available open source toolboxes for the other algorithms. We use benchmark MRPPs from~\citep{Bork2022underapproximating}, with variants on size and difficulty. Details on the problems can be found in the Appendix. For HSVI-RP, observation heuristic randomization (Remark~\ref{remark:observation mixing}) with $p = 0.5$ was used for the Drone problems, and we report the mean of the bounds obtained from $10$ runs. All experiments were run on a single core of a machine equipped with an Intel i7-11700K @ 3.60GHz CPU and $32$ GB of RAM. Our code is open sourced and available on GitHub\footnote{ \url{https://github.com/aria-systems-group/HSVI-RP}}.

\textbf{Q1: Performance of trial-based discounted-sum algorithms. \quad}
Table~\ref{tab:discountsumresults} summarizes the key results to answer Q1. From Table~\ref{tab:discountsumresults}, we see that there is not a clear way to use a discounted POMDP algorithm to get good performance for indefinite horizon maximal reachability probabilities problems. A typical discount factor used in discounted-sum POMDP problems is $0.95$ to $0.999$. Unsurprisingly, such values of discounting under-estimate the optimal probabilities. Therefore, one should increase the discount factor to as close to $1$ as possible, to get the best probabilities. However, for problems with more loops in the belief transitions, such as Refuel6 and Refuel8, increasing the discount factor causes trials to be deeper but search may not be effective (or trials may not terminate). In all cases, HSVI-RP performs as well or better than using discounted SARSOP directly.

Table~\ref{tab:fullresults} reports a set of benchmarks for algorithms that do not use discounting\footnote{These algorithms are also designed to handle minimizing or maximizing expected (positive) rewards. While it is possible to extend our approach to handle such reward structures under some assumptions, we focus on maximizing reachability probabilities in this work.} to answer Q2-Q3. PRISM and HSVI-RP provide two-sided bounds. STORM, PAYNT and SAYNT provide under-approximations, while Overapp provides over-approximations. We also provide plots of the evolution of our value bounds over time in the Appendix. We report the computed values, time taken to achieve that value, and also reports the number of beliefs expanded for the compared belief-based approaches (PRISM, STORM, Overapp, and HSVI-RP). Note that PAYNT does not expand beliefs, and although SAYNT expands beliefs, it does not report the number of beliefs expanded. As reported in \citep{andriushchenko2023symbiotic}, SAYNT typically reduces the memory usage of STORM by a factor of 3-4.

\textbf{Q2: Comparison to belief expansion-based approaches. \quad} HSVI-RP generally performs better than other belief-based approaches, exceeding the accuracy of their under- and over-approximations with faster convergence for most of the problems.  Additionally, HSVI-RP expands orders of magnitude fewer beliefs than STORM and Overapp, requiring less memory. All three compared algorithms were not able to improve their computed values much more than their best solutions due to the memory intensity of grid-based approximations and breadth-first belief expansion, while HSVI-RP was able to improve computed policies over time due to the trial-based methodology. These results strongly suggest that depth-first trial-based heuristic search that utilizes $\alpha$-vector and upper bound point set representations, is an effective belief expansion methodology for MRPP. 

\paragraph{Q3: Comparison to other approaches.}

HSVI-RP is highly competitive compared to both PAYNT and SAYNT, achieving better lower bounds in many of the problems, while also providing upper bounds. Further, HSVI-RP generally (except in the Drone problems) finds optimal solutions faster than both methods. For Refuel6, Refuel8, Grid-av 4, and Grid-av 10, SAYNT computed near-optimal solution within $377$s, but continued computations until timeout without detecting it. On the other hand, both of HSVI-RP's bounds converged for these problems, allowing termination with a near-optimal policy. Hence, the use of two-sided bounds can help inform when a near-optimal policy is found.

\textbf{Further Discussion. \quad} 
Although the efficiency of our approach is promising, we note that the rate of convergence towards the optimal probabilities can be slow for larger problems which require deep trials. From preliminary analysis, the computation time is mainly bottlenecked by Exact Upper Bound Value Iteration and Bellman backups over large numbers of $\alpha$-vectors during deep trials. Heuristic search using two-sided bounds is also less effective when the upper bound values are uninformative (as is the case for the Drone problems), possibly due to the presence of large ECs. Better model checking techniques, such as on-the-fly detection and handling of ECs, may improve belief exploration and convergence. 


Similar to STORM, HSVI-RP benefits greatly from seeding with a good policy. Our policy initialization is the blind policy, which achieves a lower bound of $0$ for most problems. In contrast, SAYNT's inductive synthesis approach finds good initial policies quickly. SAYNT performs the best among these algorithms by leveraging the strengths of the belief exploration of STORM and FSC generation of PAYNT. An integrated approach with HSVI-RP and PAYNT may be a good direction for scalable and fast verification and policy synthesis.