\documentclass[letterpaper]{article} % DO NOT CHANGE THIS
\usepackage{amsmath, amssymb}
\usepackage{aaai24}  % DO NOT CHANGE THIS
\usepackage{times}  % DO NOT CHANGE THIS
\usepackage{helvet}  % DO NOT CHANGE THIS
\usepackage{courier}  % DO NOT CHANGE THIS
\usepackage[hyphens]{url}  % DO NOT CHANGE THIS
\usepackage{graphicx} % DO NOT CHANGE THIS
\urlstyle{rm} % DO NOT CHANGE THIS
\def\UrlFont{\rm}  % DO NOT CHANGE THIS
\usepackage{natbib}  % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
\usepackage{caption} % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
\frenchspacing  % DO NOT CHANGE THIS
\setlength{\pdfpagewidth}{8.5in} % DO NOT CHANGE THIS
\setlength{\pdfpageheight}{11in} % DO NOT CHANGE THIS

% Use the postscript times font!
\usepackage{times}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{amsfonts}
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage[svgnames]{xcolor}
\usepackage{tikz}
\usetikzlibrary{automata, positioning}
\usepackage{multirow}

\usepackage{enumitem}

\usepackage[hang,flushmargin]{footmisc}

% the following package is optional:
%\usepackage{latexsym}

% See https://www.overleaf.com/learn/latex/theorems_and_proofs
% for a nice explanation of how to define new theorems, but keep
% in mind that the amsthm package is already included in this
% template and that you must *not* alter the styling.
\newtheorem{example}{Example}
\newtheorem{remark}{Remark}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{proposition}{Proposition}
\newtheorem{assumption}{Assumption}
\newtheorem{problem}{Problem}
\newtheorem{claim}{Claim}

\newcommand{\qh}[1]{\textcolor{purple}{[QH: #1]}}
\newcommand{\bk}[1]{\textcolor{olive}{[BK: #1]}}
\newcommand{\zl}[1]{\textcolor{orange}{[ZL: #1]}}
\newcommand{\tb}[1]{\textcolor{cyan}{[TB: #1]}}
\newcommand{\ml}[1]{\textcolor{blue}{[ML: #1]}}
\newcommand{\zs}[1]{\textcolor{teal}{[ZS: #1]}}
% \usepackage[normalem]{ulem} % only for \sout in \zsedit below
% \newcommand{\zsedit}[2]{{\color{teal} \sout{#1}#2}}
% \newtheorem{innercustomthm}{Theorem}
% \newenvironment{customthm}[1]
  % {\renewcommand\theinnercustomthm{#1}\innercustomthm}
  % {\endinnercustomthm}
% 
% These are are recommended to typeset listings but not required. See the subsubsection on listing. Remove this block if you don't have listings in your paper.
\usepackage{newfloat}
\usepackage{listings}
\DeclareCaptionStyle{ruled}{labelfont=normalfont,labelsep=colon,strut=off} % DO NOT CHANGE THIS
\lstset{%
	basicstyle={\footnotesize\ttfamily},% footnotesize acceptable for monospace
	numbers=left,numberstyle=\footnotesize,xleftmargin=2em,% show line numbers, remove this entire line if you don't want the numbers.
	aboveskip=0pt,belowskip=0pt,%
	showstringspaces=false,tabsize=2,breaklines=true}
\floatstyle{ruled}
\newfloat{listing}{tb}{lst}{}
\floatname{listing}{Listing}
%
% Keep the \pdfinfo as shown here. There's no need
% for you to add the /Title and /Author tags.
\pdfinfo{
/TemplateVersion (2024.1)
}

% DISALLOWED PACKAGES
% \usepackage{authblk} -- This package is specifically forbidden
% \usepackage{balance} -- This package is specifically forbidden
% \usepackage{color (if used in text)
% \usepackage{CJK} -- This package is specifically forbidden
% \usepackage{float} -- This package is specifically forbidden
% \usepackage{flushend} -- This package is specifically forbidden
% \usepackage{fontenc} -- This package is specifically forbidden
% \usepackage{fullpage} -- This package is specifically forbidden
% \usepackage{geometry} -- This package is specifically forbidden
% \usepackage{grffile} -- This package is specifically forbidden
% \usepackage{hyperref} -- This package is specifically forbidden
% \usepackage{navigator} -- This package is specifically forbidden
% (or any other package that embeds links such as navigator or hyperref)
% \indentfirst} -- This package is specifically forbidden
% \layout} -- This package is specifically forbidden
% \multicol} -- This package is specifically forbidden
% \nameref} -- This package is specifically forbidden
% \usepackage{savetrees} -- This package is specifically forbidden
% \usepackage{setspace} -- This package is specifically forbidden
% \usepackage{stfloats} -- This package is specifically forbidden
% \usepackage{tabu} -- This package is specifically forbidden
% \usepackage{titlesec} -- This package is specifically forbidden
% \usepackage{tocbibind} -- This package is specifically forbidden
% \usepackage{ulem} -- This package is specifically forbidden
% \usepackage{wrapfig} -- This package is specifically forbidden
% DISALLOWED COMMANDS
% \nocopyright -- Your paper will not be published if you use this command
\nocopyright
% \addtolength -- This command may not be used
% \balance -- This command may not be used
% \baselinestretch -- Your paper will not be published if you use this command
% \clearpage -- No page breaks of any kind may be used for the final version of your paper
% \columnsep -- This command may not be used
% \newpage -- No page breaks of any kind may be used for the final version of your paper
% \pagebreak -- No page breaks of any kind may be used for the final version of your paperr
% \pagestyle -- This command may not be used
% \tiny -- This is not an acceptable font size.
% \vspace{- -- No negative value may be used in proximity of a caption, figure, table, section, subsection, subsubsection, or reference
% \vskip{- -- No negative value may be used to alter spacing above or below a caption, figure, table, section, subsection, subsubsection, or reference
\setcounter{secnumdepth}{2} 
% \title{Recursively-Constrained Partially Observable Markov Decision Processes}
% \title{Recursively-Constrained POMDPs}

\author{
    %Authors
    % All authors must be in the same font size and format.
    Qi Heng Ho\equalcontrib \textsuperscript{\rm 1},
    Martin Feather\textsuperscript{\rm 2},
    Federico Rossi\textsuperscript{\rm 2},
    Morteza Lahijanian\textsuperscript{\rm 1},
    Zachary N. Sunberg\textsuperscript{\rm 1}
}

\affiliations{
    %Afiliations
    University of Colorado Boulder\textsuperscript{\rm 1},
    Jet Propulsion Laboratory, California Institute of Technology\textsuperscript{\rm 2}
    % George Ferguson\textsuperscript{\rm 4},
    % Hans Guesgen\textsuperscript{\rm 5}
    % Note that the comma should be placed after the superscript
%
% See more examples next
}

\begin{document}

% |      |     SARSOP $(\gamma = 0.95)$              |         SARSOP $(\gamma = 0.98)$               |          SARSOP $(\gamma = 0.99)$                                 |         SARSOP $(\gamma = 0.999)$             |          SARSOP $(\gamma = 0.99999)$        |  SARSOP $(\gamma = 1 - \epsilon)$      | Ours                |
% |---------|---------------------|----------------------|----------------------|---------------------|--------------------|--------------------|---------------------|
% | Grid4   | [0.76, 0.76]  $<1s$   | [0.86, 0.86] $<1s$      | [0.92, 0.92] $<1s$      | [0.923, 0.923] $<1s$   | [0.928, 0.928] $<1s$  | [0.928, 0.928] $<1s$  | [0.928, 0.928] $<1s$   |
% | Grid20  | [0.028, 0.049] $-$    | [0.155, 0.212] $-$     | [0.332, 0.38] $-$     | [0.709, 0.721] $-$    | [0.781, 0.782] $<1s$ | [0, 1] $-$           | [0.782, 0.783] $33s$ |
% | Refuel6 | [0.21, 0.21] $<1s$    | [0.32, 0.33] $<1s$      | [0.39, 0.39] $3.62s$   | [0.63, 0.63] $200s$   | [0.2, 0.98] $-$      | [0.18, 0.98] $-$     | [0.67, 0.67] $1.4s$   |
% | Refuel8 | [0.184, 0.184] $1.4s$ | [0.314, 0.314] $4.24s$ | [0.374, 0.375] $9.83s$ | [0.438, 0.439] $339s$ | [0.218, 0.987] $-$   | [0, 0.988] $-$       | [0.445, 0.445] $20s$  |

\section{Official Comment to all reviewers}

We thank all the reviewers for providing detailed and thoughtful reviews, comments, and questions. We respond to each reviewer with an individual rebuttal and share the release of our source code and **additional new experiment results** here.

**Release of (anonymized) source code**: [https://github.com/UAISubmission746/HSVIRP](https://github.com/UAISubmission746/HSVIRP)

We have open sourced (and anonymized) our code and benchmark data. Tables with all the results including additional experiments can also be found in the repository. 

**Additional Experiments**:

1. For Table 1, we additionally compare SARSOP with lower discount factors 0.95, 0.98, and 0.99 to further evaluate the effect of discounting. The best results are in bold. A - indicates that the algorithm did not converge in the time limit (900s).

% |      |     SARSOP $(\gamma = 0.95)$              |         SARSOP $(\gamma = 0.98)$               |          SARSOP $(\gamma = 0.99)$                                 |         SARSOP $(\gamma = 0.999)$             |          SARSOP $(\gamma = 0.99999)$        |  SARSOP $(\gamma = 1 - \epsilon)$      | Ours                |
% |---------|---------------------|----------------------|----------------------|---------------------|--------------------|--------------------|---------------------|
% | Grid4   | [0.758, 0.758],  $<1s$   | [0.857, 0.857], $<1s$      | [0.892, 0.892], $<1s$      | [0.923, 0.923], $<1s$   | **[0.928, 0.928], $<1s$**  | **[0.928, 0.928], $<1s$** | **[0.928, 0.928] <1s**  |
% | Grid20  | [0.028, 0.049], $-$    | [0.155, 0.212], $-$     | [0.332, 0.38], $-$     | [0.709, 0.721], $-$    | [0.781, 0.782], $<1s$ | [0, 1], $-$ | [**0.782, 0.783**] **33s** |
% | Refuel6 | [0.21, 0.21], $<1s$ | [0.32, 0.33], $<1s$      | [0.39, 0.39], $3.62s$   | [0.63, 0.63], $200s$   | [0.20, 0.98], $-$      | [0.18, 0.98],              $-$ | [**0.67, 0.67**], **1.4s**   |
% | Refuel8 | [0.184, 0.184], $1.4s$ | [0.314, 0.314], $4.24s$ | [0.374, 0.375], $9.83s$ | [0.438, 0.439], $339s$ | [0.218, 0.987], $-$   | [0, 0.988]                , $-$ | [**0.445, 0.445**], **20s**  |

From these additional experiments, it is clear that lowering the discount factor slightly further can lead to very bad performance.

2. For Table 2, we have also performed 3 additional benchmark problems. We consider the benchmark problems Crypt4 and Nrp8, introduced in PRISM (Norman et al. 2017), which induce finite belief MDPs. To further enrich the evaluation, we also evaluated our algorithm on the RockSample problem considered in (Bouton et al. 2020) ($\phi_2)$. The best results are in bold.


% |         | PRISM                             | STORM                   | PAYNT      | SAYNT      | Ours                             | Overapp                 |
% |---------|-----------------------------------|-------------------------|------------|------------|----------------------------------|-------------------------|
% |  Crypt4 | [**0.33**, 0.77], 33s, 124K beliefs   | **0.33**, **<1s**, 560 beliefs  | **0.33**, **<1s**  | **0.33**, **<1s**  | [**0.33, 0.33**]                                      , 15.6s, **480 beliefs** | **0.33**, **<1s**, 560 beliefs  |
% |   Crypt4      |                                   |                         |            |            | [**0.33**, 1.0], **<1s**, **25 beliefs**     |                         |
% | Nrp8    | [**0.125**, 0.189], 58s, 745K beliefs | **0.125**, **<1s**, 50 beliefs  | **0.125**, **<1s** | **0.125**, **<1s** | [**0.125, 0.125**],                                      **<1s, 32 beliefs**  | **0.125**, **<1s**, 50 beliefs  |
% | Rocks12 | N/A                               | 0.63, 1223s, 2M beliefs | **0.75** , 5.7s | **0.75** , 5.6s | [**0.75, 0.75**], **2.8s, 770 beliefs**  | **0.75**,                                                            **<1s**, 2.5K beliefs |

From these additional experiments, we see that our algorithm (HSVI-RP) generally remains the best amongst existing methods. This is in spite the fact that our method computes both lower and upper bounds together, whereas other methods (except PRISM) compute only one of the bounds. Specifically, on Crypt4, our algorithm takes $<1s$ to reach the optimal lower bound, and another 15.6s for the upper bound to converge. Since Crypt4 has many observations, more iterations are required to effectively search the space. Note that PRISM cannot run the Rocks problem as it assumes that the target state is fully observable, which is not the case in Rocks.


\section{Reviewer 1 (weak accept)}

% \begin{table*}
% \centering
% \begin{tabular}{c||c|c|c|c|c|c}
% \toprule
% & \textbf{PRISM} &  \textbf{STORM} & \textbf{PAYNT} & \textbf{SAYNT} & \textbf{Ours} & \textbf{Overapp}\\
% \hline
% \hline
% \multirow{4}{*}{Crypt4}
%  &  &  &  & & [0.33, 0.33] &  \\
%  &  [0.33, 0.77] & 0.33 & 0.33 & 0.33 & 15.6s, 480 beliefs & 0.33 \\\cline{6-6}
%  & 33s, 124K & $<1s$, 560 & $<1s$ & $<1s$ & [0.33, 1.0] & $<1s$, 560 \\
%  &   &  & & & $<1s$, 25 beliefs &  \\
% \hline
% \multirow{2}{*}{Nrp8}
%  & [0.125, 0.189] & 0.125 & 0.125 & 0.125 & [0.125, 0.125] & 0.125 \\
%  & 58s, 745K & $<1s$, 50 & $<1s$ & $<1s$ & $<1s$, 32 beliefs &  $<1s$, 50 \\
% \hline
% \multirow{2}{*}{Rocks12}
%  & N/A & 0.63 & 0.75 & 0.75 & [0.75, 0.75] & 0.75 \\
%  & & 1223s, 2M & 5.7s & 5.6s & 2.8s, 770 beliefs & $<1s$, 2.5K \\\hline
% \end{tabular}
% \end{table*}

Thank you for the encouraging comments and insightful questions. We are encouraged that the reviewer finds the work well founded and that our empirical results support the benefits of our proposed method.

**Q4.1-2: Additional figures/schematics**

That's a very good suggestion. Since the page limit allows for 2 extra pages for the final submission, we will utilize that space to add a schematic of the algorithm, together with more details in the pseudocode for Algorithm 1 to make the overall HSVI-RP algorithm and ideas in the paper easier to follow.

**Q4.3/Q5.1: Limited number of problems experimented on. Can you please comment on how the problems were chosen and why they are good representatives?**

To evaluate our algorithm against related work, we focused on existing MRPP benchmark problems that induce infinite belief MDP, which are more difficult problems than those that induce finite belief MDPs.  Hence, we used all such benchmark problems with the $P_{max}$ query presented in related papers: STORM (Bork et al. 2020), PAYNT (Andriushchenko et al. 2022), and SAYNT (Andriushchenko et al. 2023).

Per reviewers' suggestions, we performed additional evaluations on the benchmark problems Crypt4 and Nrp8, introduced in the PRISM paper (Norman et al. 2017), which induce finite belief MDPs. To further enrich the evaluations, we also benchmark our algorithm on the RockSample problem in Bouton et al. 2020 ($\phi_2)$. 

All the new results can be found in our official comment to all reviewers above.  These results show that our algorithm (HSVI-RP) generally remains the best amongst existing methods. This is in spite the fact that our method computes both lower and upper bounds together, whereas other methods (except PRISM) compute only one of the bounds. Specifically, on Crypt4, our algorithm takes $<1s$ to reach the optimal lower bound, and another 15.6s for the upper bound to converge. Since Crypt4 has many observations, more iterations are required to effectively search the space.
 
**Q5.2: Are there cases where you predict your algorithm would suffer?**
 
As discussed in the evaluation section, iterations of HSVI-RP can take a long time in very large problems which require deep trials. This occurs due to the large constructed belief graph and size of the alpha-vector set. The algorithm is also less effective when there are many long loops, where the upper bounds are not very informative.

Additionally, the trial-based method is less efficient in problems with a large branching factor or when many actions/observations have similar values since many iterations are required to effectively search the space. In the discounted POMDP literature for trial-based search, there are some approaches, such as PLEASE (Zhang et al. 2015), which attempt to combat this issue by creating additional branches during a trial. This could be an interesting future direction.

**Q5.3: Can you please comment a bit more about the different algorithms compared against, and how they were parameterized.**

We have provided a short summary of each algorithm in the related work and experiments sections. PAYNT uses a counter-example guided inductive synthesis approach by searching in the space of finite state controllers using a combination of abstraction-refinement and counterexamples. SAYNT uses reference policies from STORM to accelerate synthesis of PAYNT, and uses the policies obtained from PAYNT to improve belief expansion search for STORM. SARSOP works similarly to HSVI2, with different heuristics during each exploration trial. Unfortunately, due to space constraints, we defer the algorithm details to the cited works. We used the toolboxes provided in the cited papers. These toolboxes provided a set of recommended parameter settings. We reported the best parameters among the available settings for the compared algorithms, available in the data files in the Github repository. The results were similar to those reported in the original papers. For SARSOP, we used the default parameters in the tool, which generally perform well for discounted POMDP problems.

**Q5.4: When comparing bounds, do any of the competing algorithms produce anytime bounds?**

Except PRISM and STORM, all the competing algorithms are able to produce anytime bounds in the sense that their computed bounds monotically improve using an abstraction-refinement procedure. However, since all these methods use some sort of abstraction-refinement procedure, each abstraction iteration may require some time. For the compared algorithms, we reported the lowest runtime to reach the best value.
 
**Q5.5: Discounting rewards only produces incorrect upper bounds.**

Yes, thank you for catching this inaccuracy. We will update the text. 

**Q5.6: The regret bound for MRPP should be (0,1)**

Yes, thank you for noticing this. (0,$\infty$) is redundant as the probability values are bounded between 0 and 1. We will update the text accordingly.

\newpage
\section{Reviewer 2 (weak accept)}

We thank the reviewer for the thoughtful comments and suggestions.

**Q3: Public availability of algorithm.**

We have open sourced (and anonymized) our code and benchmark results at [https://github.com/UAISubmission746/HSVIRP](https://github.com/UAISubmission746/HSVIRP)

**Q4: Novelty.**

We agree that trial-based search for POMDPs is known to be effective. But, we believe that HSVI-RP is a novel advancement in the literature for MRPP for the following reasons: (i) There is no prior work that uses trial-based search for the undiscounted MRPP. The undiscounted and sparse reward nature of MRPP makes many common techniques for POMDPs unsuitable. In Section 3, we discuss specific aspects of why existing trial-based algorithms perform poorly for MRPP, which justifies and motivates our proposed modifications. (ii) existing methods for MRPP are mainly developed by the formal methods community, where probabilistic guarantees ($\gamma = 1$) is of utmost importance. On the other hand, trial-based methods are developed by the AI and robotics communities, where $\gamma < 1$ is typically considered. This work bridges these communities by drawing techniques from both and developing a new algorithm that significantly improves the state of the art.

**Q4: Missing analysis on upper bound convergence.**

Thanks for the feedback. Initially, we didn't include this analysis because we can only guarantee convergence of the upper bound for certain problems, but cannot generalize it. For instance, we can prove that the upper bound converges for the POMDPs that induce finite belief MDPs. However, the upper bound does not converge for MRPPs that lowering the upper bound requires reaching a belief state through an unbounded number of state transitions starting from the initial belief. This is related to the undecidability of POMDPs. Nonetheless, we emphasize that HSVI-RP provides sound upper bounds that guide the search for optimal policies. Note that among the compared algorithms, only PRISM (Norman et al. 2017) provides asymptotic convergence guarantees, while the other algorithms only provably provide sound bounds.

Per reviewer's comment, we'll add this discussion and more in-depth analysis in the final version.

**Q4: Performance comparison with Goal-HSVI.**

Thanks for this comment, giving us an opportunity to expand. We considered comparing against Goal-HSVI since it seemed very relevant to our work. But, after digging into details, we realized that Goal-HSVI cannot be used for MRPP. Goal-HSVI requires that the target state is reachable with probability 1, and the problem is to minimize costs that are strictly positive. These assumptions do not hold for the reward structure induced by MRPP. In fact, goal POMDPs (solved by Goal-HSVI) and an MRPP are both special but distinct cases of the stochastic shortest path problem, but different modifications have to be made to HSVI to work for each problem.

**Q4: HSVI-RP vs large, general model checking tools such as STORM and PRISM.**

We agree that STORM and PRISM are established model checking tools, and encompass more than just POMDP problems. But, we emphasize that they are the state-of-the-art algorithms for MRPP and other undiscounted objectives. The compared algorithms in STORM and PRISM are specialized algorithms for POMDPs with indefinite horizon objectives.

**Q4: Impact of the different parameters.**

Thank you for the comment. We will provide a discussion on parameters in the final version. Below, we provide a summary.

Parameter $\xi$: defines a radius around the best upper bound within which actions are considered. Lower $\xi$ favor actions with higher upper bounds, improving upper bounds faster, but may reduce efficiency by limiting exploration.

Parameters $c_a$ and $c_z$: exploration constants for action and observation selection. Higher values encourage more exploration but can hinder convergence if set too high due to too little exploitation.

Parameter $n$: ratio of exploration trials to Upper Bound Value Iterations (VI). While VI is crucial to improve upper bounds, it has large computation overhead as VI has to be conducted for all nodes in the belief graph. $n$ balances exploration and attempting to improve upper bounds (costly step).

Parameters $d_{trial}$ and $d_{inc}$: control the rate of increase of search depth. Higher values are beneficial for long horizon problems but may slow search efficiency if increased too quickly.

**Q4: On strategy synthesis.**

As discussed in Section 2, the optimal value function can be under-approximated arbitrarily well by a set of alpha vectors. This set of alpha vectors implicitly represent the policy. Since $V^*(b) \geq V^{\pi}(b) = \max_{\alpha \in \Gamma}(\alpha^T b)$, the action at belief $b$ is chosen using $\arg\max_{\alpha \in \Gamma}(\alpha^T b)$.

**Q5: Typos/suggestions**

Thanks for catching the errors and suggesting these changes. We'll update the paper accordingly.

% The best parameter choice is problem specific. From our empirical tests, we found that the performance of the algorithm was not very sensitive to these parameters. In the paper, we used the same parameters for all the problems, which was chosen heuristically.

% $\kappa$ tells us how much improvement in the bounds should be targeted for each trial. For larger problems with longer horizon, $\kappa$ is not very significant, because trial termination is dominated by $d_{\text{trial}}$. For smaller problems, a larger $\kappa$ leads to smaller but potentially more frequent updates in the bounds in each trial.
% Comparatively, we reported the best score among varying parameters for the compared algorithms in Table 2. 

\newpage
\section{Reviewer 3 (borderline reject)}

% > I feel space is needlessly used on content that doesn't really add to the contribution. In particular, most of of Sec 3 seems largely known/obvious. First, the reduction of MRPP to expected total reward must, by definition, be undiscounted and it is clear that it is incorrect for discount < 1. So the counterexample/proof for Proposition 1 seems unnecessary. It also seems strange to present this as trial-based methods giving "incorrect bounds". Similarly, the lack of convergence when the discount = 1 is also expected, since lambda $< 1$ is a requirement for existing results.

Thank you for your helpful and constructive review. We are pleased you share our perspective on our modifications to HSVI and its potential.

**Q4.1: Sec 3 seems largely known/obvious.**

We agree that discussions on discount and reward structure are unnecessary for a reader familiar with MRPP. However, we felt the need to include Sec. 3 as it provides justification/insight for our modifications to HSVI2 for MRPP. Our second point about trial termination with $\gamma = 1$ directly motivates an adaptively search depth trial termination strategy, while the third point motivates the graph representation, search heuristics, and upper bound backup techniques. While discounting under-approximating the true value is not a surprising result, we argue that a discussion that discounting can lead to arbitrarily bad solutions is valuable. Of note, these points are not considered in (Bouton et al. 2020) when using SARSOP for MRPP; hence our decision to include them for clarity.

**Q4.2: Theorem 1 clarity/detail.**

We will provide a detailed discussion of lower bound convergence. The action selection radius $\xi$ is introduced in Eq. (10). The type of optimal policy matters since it is impossible to approximate an optimal policy which requires infinite memory even as iterations goes to infinity, due to the inapproximability of undiscounted infinite horizon POMDPs (Madini, Hanks, and Condon, AIJ'03). HSVI-RP's lower bound convergence is contingent on being able to search trials of finite depth, as depth increases to infinity.

**Q4.2: Upper bound Analysis.**

Thanks for the feedback. Initially, we didn't include this analysis because we can only guarantee convergence of the upper bound for certain problems, but cannot generalize it. For instance, we can prove that the upper bound converges for the POMDPs that induce finite belief MDPs. However, the upper bound does not converge for MRPPs that lowering the upper bound requires reaching a belief state through an unbounded number of state transitions starting from the initial belief. This is related to the undecidability of POMDPs. Nonetheless, we emphasize that HSVI-RP provides sound upper bounds that guide the search for optimal policies. Note that among the compared algorithms, only PRISM provides guarantees on asymptotic convergence, while the other algorithms only provably provide sound bounds.

Per reviewer's comment, we will add this discussion and more in-depth analysis in the final version.

**Q4.2: Clear statement about what differs wrt normal HSVI.**

That's a good suggestion. The proof of HSVI2 relies heavily on discounting to bound the trial depths required. Loops are also not an issue due discounting. In contrast, the HSVI-RP's convergence for MRPP stems from the graph representation, termination criteria and trial-based expansion technique, which allow adequate exploration of the belief MDP. We will add this discussion in the final version. 

**Q4.3: Necessity of Q1 in Sec 5.** 

For many POMDP algorithms such as HSVI/SARSOP, the discount factor can be used as a tuning parameter, where discount is increased if more accuracy is required. Q1 analyzes the effects of this straightforward method for MRPP, and empirically validates our modifications. 

SARSOP was chosen as it is shown to performs better than HSVI. There may be multiple reasons for the resulting bound gaps, but Table 1 mainly shows that this method can perform poorly, even with discount very close to 1. Without these results, it is unclear that current trial-based methods are practically insufficient. Case in Point: Reviewer 3Pxw has further questions about using even lower discount factors to achieve similar performance, and lowering $\gamma$ more yields very poor performance (additional experiments in our official comment above).

**Q4.4: The various tools/papers compared to and cited use a larger set. Why are these not considered?**

For comparison with related work, we focused on existing MRPP benchmark problems that induce infinite belief MDP, which are more difficult problems than those that induce finite belief MDPs. Hence, we used all such benchmark problems with the $P_{max}$ query presented in related papers: STORM, PAYNT, and SAYNT.

Per reviewers' suggestions, we performed additional evaluations. All the new results can be found in our official comment to all reviewers above. For more details, please refer to our response to Reviewer zRvP (Q4.3/Q5.1).

**Q4.5: Reproducibility.**

We have open sourced our code and benchmark results at [https://github.com/UAISubmission746/HSVIRP](https://github.com/UAISubmission746/HSVIRP)

Q5.1: Why is absolute value needed?

It is not necessary; We will remove it.

Q5.2: What is a blind policy, $V_{MDP}$, WEU?

We apologize for the lack of clarity. WEU is short for Weighted Excess Uncertainty Eq. (3), and blind policy and $V_{MDP}$ are discussed in Section 4. We will make sure to introduce them properly.

% References:

% Madani, O., Hanks, S., \& Condon, A. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1-2), 5-34. 2003.

% From these additional experiments, we see that our algorithm (HSVI-RP) generally remains the best amongst existing methods. This is in spite the fact that our method computes both lower and upper bounds together, whereas other methods (except PRISM) compute only one of the bounds. Specifically, on Crypt4, our algorithm takes $<1s$ to reach the optimal lower bound, and another 15.6s for the upper bound to converge. Since Crypt4 has many observations, more iterations are required to effectively search the space. Finally, note that PRISM cannot run the Rocks problem as it assumes that the target state is fully observable, which is not the case in Rocks.

\newpage
\section{Reviewer 4 (weak accept)}

% \begin{table*}[t!]
% \centering
% \begin{tabular}{l|l|l|l||l|l|l|l}
% % \hline
% % \toprule
% & \multicolumn{6}{c|}{\underline{ \hspace{60mm}\textbf{SARSOP}\hspace{60mm}}} & \multicolumn{1}{c}{\textbf{Ours}} \\ 
% % \cline{2-4}
% & $\gamma = 0.97$ & $\gamma = 0.98$ & $\gamma = 0.99$ & $\gamma = 0.999$ & $\gamma = 0.99999$ & $\gamma = 1 - 10^{-16}$ & \\ 
% \hline
% \hline
% \multirow{2}{*}{Grid-av 4} & [ 0.758, 0.758] & [0.857, 0.857] & [0.892, 0.892] & $[0.923, 0.923]$ & $\mathbf{[0.928, 0.928]}$ & $\mathbf{[0.928, 0.928]}$ & $\mathbf{[0.928, 0.928]}$ \\
% & $<1s$ & $\mathbf{<1s}$ & $\mathbf{<1s}$ & $\mathbf{<1s}$ \\ \hline
% \multirow{2}{*}{Grid-av 20} & [0.028, 0.049] & [0.15515, 0.212294] & [0.332, 0.380] &  $[0.709, 0.721]$ & $[0.781, 0.782]$ & [0, 1]& $\mathbf{[0.782, 0.783]}$ \\
% & $-$ & $-$ & $-$ & $\mathbf{33s}$ \\ \hline
% \multirow{2}{*}{Refuel6} & [0.21, 0.21] & [ 0.32   0.33] & [0.39, 0.39]  &  $[0.63, 0.63]$ & $[0.2, 0.98]$ & $[0.18, 0.98]$ & $\mathbf{[0.67, 0.67]}$ \\
% & $<1s$ & $<1s$& 3.62 & $200s$ & $-$ & $-$ & $\mathbf{1.4s}$ \\ \hline
% \multirow{2}{*}{Refuel8} & [0.184, 0.184] & [0.314, 0.314] & [0.374, 0.375] & $[0.438, 0.439]$ & $[0.218, 0.987]$ & $[0, 0.988]$ & $\mathbf{[0.445, 0.445]}$ \\
% &1.4s & 4.24 & 9.83s &  $339s$ & $-$ & $-$ & $\mathbf{20s}$ \\ 
% % \hline
% % \bottomrule
% \end{tabular}
% \end{table*}

% |         | SARSOP              |                      |                      |                     |                    |                    | Ours                |
% |---------|---------------------|----------------------|----------------------|---------------------|--------------------|--------------------|---------------------|
% | Grid4   | [0.76, 0.76]  <1s   | [0.86, 0.86] <1s     | [0.92, 0.92] <1s     | [0.923, 0.923] <1s  | [0.928, 0.928] <1s | [0.928, 0.928] <1s | [0.928, 0.928] <1s  |
% | Grid20  | [0.028, 0.049] -    | [0.155, 0.212] -     | [0.332, 0.38] -      | [0.709, 0.721] -    | [0.781, 0.782] <1s | [0, 1] -           | [0.782, 0.783] <33s |
% | Refuel6 | [0.21, 0.21] <1s    | [0.32, 0.33] <1s     | [0.39, 0.39] 3.62s   | [0.63, 0.63] 200s   | [0.2, 0.98] -      | [0.18, 0.98] -     | [0.67, 0.67] 1.4s   |
% | Refuel8 | [0.184, 0.184] 1.4s | [0.314, 0.314] 4.24s | [0.374, 0.375] 9.83s | [0.438, 0.439] 339s | [0.218, 0.987] -   | [0, 0.988] -       | [0.445, 0.445] 20s  |

We thank the reviewer for your careful review and constructive feedback.

**Q4: The experimental validation is narrow and therefore not very convincing.**

For comparison with related work, we focused on existing MRPP benchmark problems that induce infinite belief MDP, which are more difficult problems than those that induce finite belief MDPs. Hence, we used all the benchmark problems with the $P_{max}$ query presented in related papers: STORM (Bork et al. 2020), PAYNT (Andriushchenko et al. 2022), and SAYNT (Andriushchenko et al. 2023).

Per reviewers' suggestions, we performed additional evaluations. All the new results can be found in our official comment to all reviewers above. For more details, please refer to our response to Reviewer zRvP (Q4.3/Q5.1).

**Q5: SARSOP: Wouldn't similar solutions be obtained with a bit lower gammas (e.g. 0.99, 0.98, 0.95), with correspondingly (much?) lower runtimes? **

For many POMDP algorithms such as HSVI/SARSOP, the discount factor can be used as a tuning parameter, where discount is increased closer to 1 if more accuracy is required. Therefore, we decided to present a set of experiments that answer Q1 using varying discount factors that are very close to 1. We have run additional experiments to analyze the effects of even lower discounting. The results for $\gamma=$ 0.99, 0.98, 0.95 can be found in the official comments to all reviewers above and the anonymized Github repository: [https://github.com/UAISubmission746/HSVIRP](https://github.com/UAISubmission746/HSVIRP)

From the results, we see that just setting $\gamma$ slightly lower can lead to very poor performance, even though it generally leads to lower runtimes (albeit not always!). This poor performance of SARSOP for MRPP experimentally validates our discussion in Section 3 and our proposed modifications. For MRPP, the optimal policy can be difficult to find using existing trial-based algorithms designed for discounted problems since the sparse (goal) reward may only be obtained after a long horizon, even for problems with small state spaces.

**Q5: The pseudocode in Algorithm 1 ignores too many details to make sufficient sense.**

Thank you for the feedback. Since the page limit allows for 2 extra pages for the final submission, we will add more details in the pseudocode for Algorithm 1 and add a schematic of the algorithm, to make the overall HSVI-RP algorithm and ideas easier to follow.

**Q5: In Table 1, the "number of belief states explored" is only given for STORM and Ours.**

STORM, Overapp, and ours are mainly belief expansion-based, and so the number of belief states explored is most relevant for Q2 and thus reported. The number of beliefs for Overapp in Refuel20 was mistakenly omitted - it should have 177K beliefs. PAYNT does not expand beliefs, and although SAYNT expands beliefs, it does not report the number of beliefs expanded. The number of grid belief points for PRISM is 100 (grid-4), 97 (grid-10), and 6K (Refuel6). These are reflected in the updated tables in the linked repository.

**Q5: What is meant by "best"?**

"Best" is the result that achieves the best value, and lowest runtime if multiple methods achieve the same value. We apologize for a typo in the entry Refuel6, where the time taken for STORM and PAYNT are swapped. STORM should be 1.4s (and bold) and PAYNT should be 77.8s. These results are in line with the results reported in the original papers. We will fix the table in the final version. An updated table can be found at [https://github.com/UAISubmission746/HSVIRP](https://github.com/UAISubmission746/HSVIRP)

**Q5: Suggestions / Citation / Bibliography /  notation**

Thank you for the suggestions and catching these inaccuracies. We will update the citation of undecidability to (Madani, Hanks and Condon 2003), and cite the original works when introducing point-based methods. We will clarify the reduction of POMDPs to MDPs with infinite number of states. We will also correct the bibliography issues.

\newpage
\section{Reviewer 5 (weak accept)}

We thank the reviewer for the thoughtful comments and questions.

**Q4.1: Over-approximation convergence.**

We'd like to clarify that we do not claim HSVI-RP has upper bound convergence guarantees, and we apologize for the confusion.  In the paper, we say that we empirically observe convergence for some of the benchmark problems.  We can guarantee upper bound convergence in certain cases, e.g., when the POMDP induces a finite belief MDP. But, there are MRPPs in which the upper bound does not converge. These are MRPPs where lowering the upper bound requires an infinite number of explored beliefs, related to the undecidability of infinite horizon POMDPs (Madini, Hanks, and Condon, AIJ'03). We will add this discussion and ensure clarity in the final version.

We also remark that, for MRPP, there is a gap in the literature for a well-performing algorithm with two-sided convergence. Among the compared algorithms, only PRISM provides guarantees on asymptotic convergence; the other algorithms have provably sound bounds. Most of the algorithms provide one-sided bounds with no algorithm utilizing both bounds to guide search. HSVI-RP is a step in this direction with provably sound two-sided bounds used to guide the search. HSVI-RP has proven convergence of the lower bound, and achieves convergence of both bounds for some problems.

**Q4.2: Assumption of existence of optimal finite memory policy.**

We should clarify that the assumption is for an optimal finite memory policy. In a POMDP, it is known that the best memoryless policy (mapping observations to actions) can be arbitarily suboptimal in the worst case (Littman ICSAB'94). HSVI-RP finds belief-based policies, which are history-dependent. However, in general optimal belief-based policies for POMDPs with indefinite horizon require infinite memory, and is undecidable (Madini, Hanks, and Condon, AIJ'03). Thus, the finite memory assumption is a practical requirement for any computed policy. 

Note that for all benchmark problems except Drone, HSVI-RP finds policies with upper and lower bounds almost converged to a single value, indicating optimality.
It is unclear if the Drone problem has an optimal finite memory policy, but we show that our method still performs well compared to existing methods in it, and monotonically improves over time, as seen in Figure 2 in Appendix E. To reduce subjectivity, we will reword "mild conditions" to "some conditions".

**Q4.4: Experimental setup in Appendix.**

We'll add these details into the main text.

**Q5.Question1: What is the idea of the extension to rewards?**

For clarity, we do not claim that HSVI-RP can be directly extended to other reward structures (e.g. maximizing a non-negative reward). The idea is to tackle the problem as a stochastic shortest path problem. Our lower bounds directly work for maximizing non-negative rewards. The main difficulty is in initializing upper bounds. The direction we are considering is problems where it is reasonable to restrict ourselves to proper policies (Bertsekas TAC'18). With proper policies, we can use the $V_{MDP}$ upper bound from an MDP. For future work, it may be possible to extend HSVI-RP to handle such reward structures under certain conditions such as proper policies.

**Q5.Question2: Runtime for STORM on Refuel6.**

We apologize for the confusion. There was a typo for Refuel6, where the time taken for STORM (should be 1.4s) and PAYNT (should be 77.8s) are swapped. These results are in line with the results reported in the original papers. We'll fix it in the final version. Nonetheless, this occurs in STORM in the benchmark Rocks12, shown in our additional experiments for Q1. In Rocks12, the target state is partially observable, which violates an assumption made in STORM. This may affect the heuristics used in STORM.

**Q5.Comment2: "In addition, they [other methods] suffer from scalability" may need a reference**

Thanks for pointing this out!  This has been our observation (specifically for STORM and PAYNT), but we shouldn't state it formally. For that reason, we will remove that statement. 

**Q5.Comment3: On Bork et al. (2020) method**

Both the breadth-first exploration of a discretized belief space and cut-offs are used. Cut-offs are used to circumvent needing to explore the entire belief MDP.

**Q5.Comment5: Table 2 - What does * indicate?**

* indicates that we used the best reported results from the cited paper, as we achieved significantly worse results over the parameters when ran on our machine. For the results without a *, we reproduced similar results to those in the cited papers. We will clarify it in the final version.

**Q5.Comment6: STORM and Overapp are both part of Storm.**

Thanks for this comment. Yes, they are both in the STORM toolbox, and there is an option to obtain both bounds sequentially. However, as they are not part of the same algorithm, the bounds are not used to inform each other, unlike this work. We will clearly state this in the paper.

% **Q5.spelling/grammar**

% Thank you for the suggestions and feedback. We will update the paper accordingly.


% \ml{do we need the reference?}
% References:

% M. Littman (1994). Memoryless policies: theoretical limitations and practical results. In Proc. 3rd Intl' Conference on Simulation of Adaptive Behavior.

% the assumption that there exists an optimal policy only occurs in Theorem 1, though the paper is written as if the assumption implicitly holds (in definitions etc.). In addition, I believe some of the benchmarks used do not confirm to this assumption.

% Some details of the experimental setup for the evaluation are only part of the supplementary material (tool configurations)

% > In Table 2, there is a bit of inconsistency how runtimes <1s are reported (see Grid-av 4-0.1 "ours" vs. Refuel6 Overapp)

% Response: We will fix it to report $<1$ instead.

% The extension to expected reward specifications is hinted at, though details are missing, the claim that it is possible to extend HSVI-RP to reward structures (Section 5) needs more explanation. I do not think that it is easy to see how this is supposed to work. 

\end{document}