\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{xcolor}

\newcommand{\qh}[1]{\textcolor{purple}{[QH: #1]}}
\newcommand{\bk}[1]{\textcolor{blue}{[BK: #1]}}
\newcommand{\zs}[1]{\textcolor{red}{[ZS: #1]}}
\usepackage[normalem]{ulem} % only for \sout in \zsedit below
\newcommand{\zsedit}[2]{{\color{red} \sout{#1}#2}}
\newcommand{\ml}[1]{\textcolor{cyan}{[ML: #1]}}

\begin{document}

\section{GJpu}

We thank the reviewer for the thoughtful comments and questions.

**Q4.1: Policy as a function of history.**

Thanks for the comment, which gives us an opportunity to clarify. For an RC-POMDP, we do not need to explicitly store the full history. Since the evolution of the augmented belief-admissible cost state $\bar{b}_t$ is Markovian, and due to the satisfaction of BPO in RC-POMDPs, only a belief and admissible cost are needed to execute the policy. In other words, $\bar{b}_t$ is a sufficient statistic of the history for RC-POMDP. We use $h_t$ in the equations for RC-POMDP problem formulation and other theoretical discussions for clarity and notation consistency.

It should also be noted that, like all HSVI-based algorithms, our algorithm uses a tree that implicitly contains the history in the path from the root node to another node, but the policy produced by the algorithm is still sound outside this tree. In the final version, we will clarify that history is not needed for policy execution.

**Q4.2: State space has to be augmented with current "cumulative cost".**

We would like to clarify that we do not augment the state space with cumulative cost. Instead, we directly augment the belief space with cumulative cost. A belief state is a simplex over the states, and therefore the size of a belief state is $n$ for state spaces of $n$ cardinality. Since the number of costs is generally significantly lower than the state space cardinality, this belief space augmentation incurs insignificant performance costs. Therefore, the search space of the problem is not increased significantly.

**Q5.2: Do any of them (compared algorithms) correspond to function-approximating the occupation measure of the belief state? Have you tried it? Does it work? How does it compare?**

Thanks for this comment, which made us think carefully about approaches based on occupation measure. An optimal policy of a C-POMDP can be obtained by solving an LP using a discounted occupancy measure (Altman 1999, Lee et al. Neurips'18). However, solving this LP exactly is intractable since the belief space may have infinite cardinality. Instead, existing methods solve the dual LP by computing a optimal solution $\lambda^*$. CC-POMCP (Lee et al. 2018) uses Monte-Carlo tree search with a subgradient method to approximate $\lambda^*$, while CGCP (Walraven and Spaan 2018) uses a column generation method to solve the LP defined over the entire policy space.

CC-POMCP performs well, but only has asymptotic guarantees on constraint satisfaction and convergence. CGCP enjoys convergence to optimality, and computes sound solutions. Since our work focuses on the problem formulations of C-POMDP and RC-POMDP with guarantees on constraint satisfaction, we compare to CGCP in our evaluation. In our evaluation, CGCP exhibits the drawbacks of C-POMDP policies, validating our theory and justification for RC-POMDPs. This is because the unintuitive behavior stems from the C-POMDP formulation itself, rather than the method for solving it.

For RC-POMDP, it is not clear to us how to formulate a method that uses occupancy measure with the recursive cost constraints. Instead, we develop a tree-based policy search algorithm that finds admissible policies, in order to validate our theory and proposed RC-POMDP formulation.

**Q4.3: In some sense Theorem 2 and 3 are not surprising once the state space is augmented, and the Bellman operator is defined in that particular way.**

Yes, the reviewer is correct, and it is in fact by design. One of the main focuses of our work is to provide a problem formulation in which the Bellman's principle of optimality can be satisfied, which allows consistency of planning over successive decision steps to mitigate the discussed drawbacks of C-POMDP policies and enables dynamic programming techniques.

**Q4.4: Experimental results are not surprising since the other algorithms have been designed with different objectives in mind.**

Yes, the reviewer is again correct in that the results follow and validate our theoretical analysis and highlight advantages of the proposed problem/approach. We stress that the purpose of our experiments is to evaluate the behavior of C-POMDP policies on common POMDP problems with constraints. In particular, the C-RockSample problem is a C-POMDP benchmark problem from (Lee et al. 2018). These results serve as examples of POMDPs in which C-POMDP solutions exhibit the undesirable (``stochastically self-destructive") behavior, thereby validating our theoretical discussion and justification for proposing RC-POMDPs.

**Q5.1: There is typo in Theorem 3 (asymptotic claim).**

Thank you for catching this. We will update the asymptotic result of $V_C^{\pi^n}$ to $V_C^{\pi^\infty}$. Note that although $V_R^{\pi^*}$ is a unique fixed point, $V_C^{\pi^\infty}$ does not converge to a fixed point $V_C^{\pi^*}$, since an optimal reward-value function may have multiple cost-value functions that satisfy the cost constraints.

\section{Reviewer 2: PKtD}

We thank the reviewer for your encouraging review. We are glad that the reviewer finds the contribution of our work significant. 

\section{Reviewer 3: JAak}

We thank the reviewer for the helpful comments.

**Q4.1: Constraints too stringent.**

RC-POMDPs are indeed more stringent than C-POMDPs, but we disagree that they are "infinitely more stringent" than C-POMDPs (as mentioned in the reviewer's summary in Q1). We stress that the imposed constraints are designed to be stringent enough to circumvent the unintuitive behavior of C-POMDP policies, but not as stringent as a worst-case constraint. In fact, as discussed in Remark 2 in Section 3, RC-POMDP falls in between the two extreme cases. C-POMDPs bound the expected total cost of state trajectories, enabling belief trajectories with low expected cost to compensate for high expected cost ones. Conversely, a worst-case constraint formulation, which never allows any violations during execution, may be overly conservative. RC-POMDPs strike a balance between the two; it bounds the expected total cost for all belief trajectories, only allowing cost violations during execution due to state uncertainty. This is illustrated by our experiments, which generally show that, while RC-POMDPs policies avoid pathological behaviors, their rewards remain competitive with the rewards of optimal C-POMDP policies.

**Q4.2/Q5: Why not simply increase cost?**

Thank you for the question. Increasing cost is a practical method for many situations. However, there are two main limitations of just increasing the cost:

1. Increasing cost doesn't prevent the agent from traversing tunnel A. For example, suppose one policy (traverse tunnel B) incurs 0 cost and reward, while another policy (traverse tunnel A) has high reward but incurs constraint-violating cost. Then, a C-POMDP optimal policy allocates as much maximum probability to the high cost policy as possible without violating the constraint. The cost for tunnel A can be made arbitrarily large, and the resulting mixed policy reduces the probability of traversing A, but it remains nonzero. In Example 1, if the cost of tunnel A is increased to a large value $c$, the expected cost is $0.8c$. With cost threshold $0 < \hat{c} < \infty$, the optimal C-POMDP policy traverses A with probability $\frac{\hat{c}}{0.8c}$ and traverses B with probability $1 - \frac{\hat{c}}{0.8c}$. The unintuitive behavior of C-POMDPs is not addressed by simply increasing cost.

2. A POMDP is inherently characterized by state uncertainty. If cost in a state is too high (so as to behave like a worst-case constraint), the C-POMDP problem may be infeasible as long as some probability mass is in that state, as all policies violate the constraints. Thus, increasing cost can become overly conservative.

In many partially observable (even safety-critical) problems, there may be undesirable state-actions best modeled with expected costs thresholds, rather than worst-case constraints. An example of this is a sample exploration mission in which there is some risk of getting stuck when traversing some partially observable terrain, but traversal may be necessary for mission completion. This motivates RC-POMDPs, which strike a balance between C-POMDPs and worst-case constraints.

**Q4.3/Q5: Comparison with no-regret learning approach.**

Thank you for the interesting and important question. The mentioned work (Kalagarla et al. 2022) performs a similar primal-dual approach, with a different dual update procedure, as the CGCP algorithm (Walraven and Spaan 2008) discussed and compared to in our paper. CGCP solves a sequence of unconstrained POMDP optimizations in the form of $R - \lambda C$, and finds optimal (mixed) policies for C-POMDPs. In contrast, Kalagarla et al. finds a deterministic policy, but may be suboptimal for C-POMDPs as optimal policies for C-POMDPs are generally stochastic (Kim et al. 2011). In a sense, Kalagarla et al. 2022 restricts the search to deterministic policies.

In our counter-example (CE) in Figure 1, which is an abstracted variation of Example 1, the optimal C-POMDP policy is deterministic. The algorithm of Kalagarla et al. 2022 would compute exactly that optimal policy. However, as discussed in Section 2.1, that policy exhibits stochastic self-destruction.

Empirically, from Table 1 in Section 6, CGCP finds the optimal C-POMDP deterministic policy. There exists an admissible policy for CE, found by our algorithm, but CGCP always decides to traverse tunnel A. The result for CGCP-CL (CGCP with replanning) in CE shows that there are indeed replanning inconsistencies stemming from the violation of the optimal substructure property, even for problems which have deterministic optimal policies (or when restricted to deterministic policies). Therefore, in CE and in general, Kalagarla et al. 2022 does not find admissible policies and would have replanning issues.

In summary, the algorithm of Kalagarla et al. 2022 still suffers from the shortcomings of the C-POMDP formulation. We will discuss this, and consider comparing to it, in the final version of the paper.

\section{Reviewer 4: tsMM}

We thank the reviewer for the helpful comments.

**Q4/Q5.1: Why not consider a Multi-objective POMDP by considering a vector of value (R, C) by considering as soon as C is greater than a threshold, C becomes infinite?**

Thanks for this question. It helps us to explain our method from the Multi-objective POMDP (MO-POMDP) perspective. In essence, our algorithm performs exactly the suggested approach. At each node of the tree, we keep track of upper and lower bounds on R and C. As soon as the lower bound on C is greater than a threshold, that part of the search space is pruned from consideration, and the admissibility horizon set to 0. This is equivalent to saying C becomes infinite. Performing your suggestion would be functionally equivalent to our current algorithm. The only difference is that we also keep track of the admissibility horizon, which is useful for explainability purposes for the case when no admissible policy exists. 

It is also worth noting that the suggested MO-POMDP perspective explains our algorithmic approach well, but the problem it solves is still the RC-POMDP problem, and not the C-POMDP problem. We will include a discussion on the MO-POMDP perspective in the final version of the paper.

**Q5.2: Experiment with very large cost threshold (so as to be equivalent to an unconstrained POMDP).**

Thank you for the suggestion! This is an interesting question that touches on how generalizable and efficient our RC-POMDP algorithm and other C-POMDP algorithms are for problems that are less constraining. Since all policies are admissible in such a case (and so our algorithm does not have issues with conservatism), our algorithm is guaranteed to asymptotically converge to the optimal solution, but we would expect it to converge at a slower rate than a "fast" unconstrained POMDP algorithm.

Since our algorithm needs to keep track of admissible cost values, we mainly use a policy tree representation. This representation is less efficient than the $\alpha$-vector policy representation used in SARSOP and other "fast" offline unconstrained POMDP algorithms, which allow value improvements at a belief state to directly improve values at other belief states. Therefore, we would expect that the algorithm is less efficient in converging to near-optimal solutions than a more specialized unconstrained POMDP algorithm and a C-POMDP algorithm that utilizes these unconstrained POMDP algorithms.

To answer this question, we have conducted some additional preliminary experiments. Due to time and computation constraints, we focused on our algorithm (RC-POMDP), CGCP (optimal C-POMDP), and SARSOP (Kurniawati et al. 2008) (unconstrained POMDP algorithm). We will include a more comprehensive empirical analysis in the final version of the paper.

|                 | Ours           | CGCP           | Unconstrained POMDP (SARSOP) |
|-----------------|----------------|----------------|------------------------------|
|                 | (Reward, Cost) | (Reward, Cost) | Reward                       |
| C-Tiger (300s time limit)          | (-1.4, 3.2)    | (1.90, 3.2)     | 1.93                         |
| CE (300s time limit) | (12.0, 4.5)    | (12.0, 5.0)    | 12.0                         |
| Tunnels (300s time limit)         | (1.92, 1.6)    | (1.56, 1.92)   | 1.92                         |
| CRS(4,4) (300s time limit) | (16.9, 2.2)    | (16.9, 2.4)    | 16.9                         |
| CRS(5,7) (300s time limit)| (14.9, 2.1)    | (14.8, 3.6)    | 23.9 |
| CRS(5,7) (1000s time limit)| (15.3, 2.2)    | (24.0, 4.5)    | 24.0 |

As seen in the table, our algorithm performs similar to CGCP and the unconstrained POMDP algorithm SARSOP for most smaller problems. The C-Tiger problem benefits greatly from the $\alpha$-vector representation, since the optimal policy repeatedly cycles among a small set of belief states (which our algorithm considers different augmented belief-admissible cost states). For slightly larger problems (CRS(5,7)), the efficient $\alpha$-vector representation and other heuristics of SARSOP (which CGCP takes advantage of, since it repeatedly calls SARSOP) enables much faster convergence than the policy tree-based method of our approach. Nonetheless, as time is increased, our algorithm does slowly but surely improve values.

An interesting future direction would be to look into how an RC-POMDP algorithm can take advantage of the relative relevance of cost to reward in different parts of the search space. Note that algorithms like CGCP and CC-POMCP (Lee et al. 2018) are implicitly guided by the relative relevance of cost through the dual formulation (R - $\lambda$C). For an effectively unconstrained problem, these C-POMDP algorithms can converge relatively quickly by finding the optimal $\lambda^* = 0$, which reduces to the unconstrained POMDP problem.

\section{Reviewer 5: abPA}

We thank the reviewer for the insightful comments and questions.

**Q4.1: Scalability.**

That is a good point. You are correct that the primary focus of the work is the discussion on C-POMDPs and the RC-POMDP formulation; the algorithm is the secondary focus of the paper. That is why we decided to use smaller problems for evaluations which make illustrations and comparisons easier.

Currently, the algorithm scales to moderately sized problems. The largest problem we considered in the paper is CRS(5,7), which has a state, action, and observation space of (3201, 12, 3). The largest problem we found that we could scale to is CRS(11,11), which has a state, action, observation space of (247809, 16, 3), which our algorithm obtains a reward-value of 5.99 in 300s. Our algorithm finds an admissible policy but does not converge given the short time limit. Since we also wanted to compare with CGCP-CL which takes the full time limit for each action call, increasing the time limit was prohibitively time intensive for the main purpose of comparing the behaviors of RC-POMDPs to C-POMDPs and seeing the prevalence of the stochastic self-destruction behavior of C-POMDP policies. We will add results for CRS(11,11) to the final version of the paper.

We emphasize that the problems our algorithm can currently solve are large enough to be useful in real-life safety-critical applications, such as route decisions for navigation-based problems and high-level decisions for robotic search and rescue.

**Q4.2: Other baseline algorithms.**

Thanks for the suggestion. For a few reasons, we believed our chosen comparison algorithms were adequate for the purpose of our initial evaluation. CGCP finds optimal C-POMDP policies, and CGCP-CL evaluates the replanning performance of CGCP. These algorithms support our theoretical discussion. CPBVI and CPBVI-D are the closest to our algorithm with a point-based value iteration technique. These algorithms illustrate the performance of our algorithm compared to similar methods.

We considered comparing to CALP (Poupart et al. 2015), but since CGCP is shown to be more performant with better scalability, we focused on CGCP. The projected gradient ascent (PGA) technique of Wray and Czuprynski 2022 is very interesting. Since the algorithm solves C-POMDP, we would expect that the policies still exhibit stochastic self-destruction behaviors.

That said, the PGA approach is promising, and it may be possible to design an algorithm for RC-POMDPs using PGA. We will add a discussion on Wray and Czuprynski 2022, and other techniques that compute controllers directly in our related work section. For future work, where the focus is the algorithm, we will compare to a wider class of algorithms and larger benchmark problems.

**Q5.1: Description of rewards**

That is a very good suggestion. We will add a description of reward that incentivizes traversing tunnel A.

**Q5.2: More case studies on stochastic self-destruction.**

Thanks for the insightful comment. Although we agree that the violation rate metric is not a perfect one, it is still informative in that it shows the percentage of trials that exhibit the stochastic self-destruction behavior.

From our current benchmarks, all of the problems showcase existence of stochastic self-destruction behavior. The difference between the violation rate of CGCP and CGCP-CL (CGCP with closed-loop replanning) is higher for some models, which might signal that these problems showcase stronger effects of violation of the optimal substructure property, and thus stochastic self destruction.

Per your suggestion, we plan to study which models showcase strong cases of stochastic self-destruction, and develop more metrics that signal stochastic self-destruction.

**Q5.3: Summary of benchmarks**

We will add a summary of the considered benchmarks with the sizes of their (state, actions, observations): C-Tiger: (2, 3, 2), CE: (5, 2, 2), Tunnels: (53, 3, 5), CRS(4,4): (201, 8, 3), CRS(5,7): (3201, 12, 3), CRS(11,11): (247809, 16, 3).

**Q5.4: Future Work**

Thanks for the suggestion. We will add a section on future work as we believe RC-POMDP opens up many avenues for research. Here, we provide a (non-exhaustive) summary.

1. Analyzing classes (or conditions) of RC-POMDPs that are approximable.

2. RC-POMDPs provide arguably more desirable policies than C-POMDPs, but the cost constraints remain on expectation. For some applications, probabilistic or risk measure constraints may be more desirable than expectation constraints. These formulations also benefit from the recursive constraints that we propose for RC-POMDPs.

3. Scalability: Our algorithm can benefit from better policy search heuristics and more efficient policy representations. We also plan to explore other methods, such as finite state controllers (e.g. PGA), occupation measures, and other approximation methods (e.g. online tree search (Lee et al. 2018, Jamgochian et al. 2023, etc)).

\end{document}