\section{Experimental Evaluation}

To the best of our knowledge, this work is the first to propose and solve RC-POMDPs, and thus there are no existing algorithms to compare to directly. The purpose of our evaluation is to (i) empirically compare the \emph{behavior} of policies computed for RC-POMDPs with those computed for C-POMDPs, and (ii) evaluate the performance of our proposed algorithm for RC-POMDPs. To this end, we consider the following offline algorithms to compare against our \textbf{ARCS}\footnote{Our code is open sourced at \url{https://github.com/CU-ADCL/RC-PBVI.jl}}:

\begin{itemize}[nosep] %, label={},leftmargin=0pt
    \item \textbf{CGCP} \cite{walraven2018cgcp}: Algorithm that computes near-optimal policies for C-POMDPs using a primal-dual approach.
    \item \textbf{CGCP-CL}: Closed-loop CGCP with updates on belief and admissible cost at each time step.
    \item \textbf{Exp-Gradient \cite{kalagarla22aNoRegret}}: Algorithm that computes mixed policies using a no-regret learning approach with a primal-dual approach using an Exponentiated Gradient method.
    \item \textbf{CPBVI} \cite{kim2011cpbvi}: Approximate dynamic programming that uses admissible cost as a heuristic.
    \item \textbf{CPBVI-D}: We modify CPBVI to compute deterministic policies to evaluate its efficacy for RC-POMDPs.
    % \item \textbf{ARCS} (Ours).
\end{itemize}

Since the purpose of our comparison between RC-POMDPs and C-POMDPs is mainly with regard to constraints, we do not compare to online C-POMDP algorithms such as CC-POMCP \citet{Lee2018ccpomcp} which can handle larger problems but do not have anytime guarantees on constraint satisfaction.

We consider the following environments: (i) \textbf{CE}: Counterexample in Figure~\ref{fig:counterexample}, (ii) \textbf{C-Tiger}: A Constrained version of Tiger POMDP \cite{Kaelbling1998pomdp}, (iii) \textbf{CRS}: Constrained RockSample \cite{Lee2018ccpomcp}, and (iv) \textbf{Tunnels}: A scaled version of Example~\ref{ex:caveexample}, shown in Figure~\ref{fig:tunnels problem}. Details on each problem, experimental setup, and algorithm implementation are in the Appendix. For all algorithms except CGCP-CL, solve time is limited to $300$ seconds and online action selection to $0.05$ seconds. For CGCP-CL, $300$ seconds was given to re-compute each action. We report the mean discounted cumulative reward and cost, and constraint violation rate in Table~\ref{tab:results}. The constraint violation rate is the fraction of trials in which $d(h_t)$ becomes negative, which means Eq.~\eqref{eq:pre-recursive constraints} is violated. 

\begin{table}[thb]
    \centering
    \scalebox{0.85}{
    \begin{tabular}{l | l | c | c| c} 
    Env. & Algorithm & Violation Rate & Reward & Cost\\ 
    \hline
    \multirow{2}{*}{CE}  & CGCP & $0.51$ & $\color{DarkBlue} \mathbf{12.00} $ & $5.19$ \\
                                                & CGCP-CL & $\mathbf{0.00}$ & $6.12$ & $3.25$ \\
                                                ($\hat{c}=5$)& 
                                                Exp-Gradient & 0.49 & 11.87 & 4.98 \\
                                                & CPBVI & $\mathbf{0.00} $ & $8.39$ & $4.38$ \\
                                                & CPBVI-D & $\mathbf{0.00} $ & $6.10$ & $3.54$ \\
                                                & Ours & $\mathbf{0.00}$ & $\color{DarkGreen}\mathbf{10.00}$ & $5.00$ \\\hline
    \multirow{2}{*}{C-Tiger} &  CGCP & $0.75$ & $-1.69$ & $3.00$ \\ 
                                                & CGCP-CL  & $0.14$ & $-2.98$ & $2.93$ \\
                                                & Exp-Gradient & 1.0 & $\color{DarkBlue}\mathbf{1.81}$ & 3.22\\
                                                ($\hat{c}=3$)& CPBVI & $0.15$ & $-11.11$ & $2.58$\\
                                                &  CPBVI-D & $0.09$ & $-9.49$ & $2.76$ \\
                                                & Ours & $\mathbf{0.00}$ & $\color{DarkGreen}\mathbf{-5.75}$ & $2.98$\\
    \hline
    \multirow{2}{*}{CRS(4,4)} & CGCP & $0.51$ & $\color{DarkBlue}\mathbf{10.43}$ & $0.51$ \\ 
                                            & CGCP-CL & $0.78$ & $1.68$ & $0.72$ \\
                                            ($\hat{c}=1$) & Exp-Gradient & 0.30 & 10.38 & 0.92\\
                                             & CPBVI & $\mathbf{0.00} $ & $-0.40 $ & $0.52$ \\
                                             & CPBVI-D & $\mathbf{0.00} $ & $0.64$ & $0.47$\\
                                            & Ours & $\mathbf{0.00}$ & $\color{DarkGreen}\mathbf{6.52}$ & $0.52$\\
    \hline
    \multirow{2}{*}{CRS(5,7)} & CGCP & $0.41$ & $\color{DarkBlue} \mathbf{11.98}$ & $1.00$\\ 
                                            & CL-CGCP & $0.18$ & $9.64$ & $0.99$ \\
                                            ($\hat{c}=1$)& Exp-Gradient & 0.30 & 11.90 & 1.31\\
                                            & CPBVI & $\mathbf{0.00} $ & $0.00 $ & $0.00 $ \\
                                             & CPBVI-D & $\mathbf{0.00} $& $0.00 $ & $0.00 $\\
                                            & Ours & $\mathbf{0.00}$ & $\color{DarkGreen}\mathbf{11.77}$ & $0.95$ \\          
    \hline
    \multirow{2}{*}{CRS(7,8)} &  CGCP & $0.36$ & $10.78$ & $0.945$\\  
                                            &  CL-CGCP & $0.20$ & \color{DarkBlue}$\mathbf{11.17}$ & $0.931$ \\ 
                                            ($\hat{c}=1$) & 
                                             EXP-Gradient & 0.32 & 10.03 & 1.15\\
                                            & CPBVI & $\mathbf{0.00}$ & $0.0$ & $0.0$ \\ 
                                             & CPBVI-D & $\mathbf{0.00}$& $0.0$ & $0.0$\\ 
                                             & Ours & $\mathbf{0.00}$ & $6.61$ & $0.960$ \\
    \hline
    \multirow{2}{*}{Tunnels} & CGCP & $0.50$ & $1.61$ & $1.01$\\ 
                                                & CL-CGCP & $0.31$ & $1.22$ & $0.68$\\
                                                ($\hat{c}=1$)& 
                                                Exp-Gradient & 0.48 & 1.35 & 0.82 \\
                                                & CPBVI & $0.90$ & $\color{DarkBlue} \mathbf{1.92}$ & $1.62$\\
                                                & CPBVI-D & $0.89$ & $\color{DarkBlue} \mathbf{1.92}$ & $1.57$\\
                                                & Ours & $\mathbf{0.00}$ & $\color{DarkGreen}\mathbf{1.03}$ & $0.44$
    \end{tabular}
    }
    \caption{Results for benchmarks. We report the mean for each metric. We bold the best violation rates in \textbf{black}, the highest reward with violation rate greater than $0$ in \textcolor{DarkBlue}{blue}, and the highest reward with $0$ violation rate in \textcolor{DarkGreen}{green}. Standard error of the mean, and problem parameters can be found in the appendix.}
\label{tab:results}
\end{table}

\begin{figure} \label{fig:tunnels}
    \centering
    \includegraphics[width=0.6\linewidth]{figures/Cave.pdf}
    \caption{Tunnels. There is a cost of $1$ for rock traversal (red regions) and $0.5$ for backtracking. Trajectories from CGCP (blue) and ARCS (green) are displayed, with opacity approximately proportional to frequency of trajectories.}
    \label{fig:tunnels problem}
\end{figure}

In all environments, ARCS found admissible policies ($k = \infty$). In contrast, CGCP, Exp-Gradient, CPBVI and CPBVI-D only guarantees an admissible horizon of $k =1$, since the C-POMDP constraint is only at the initial belief. CGCP-CL may have a closed-loop admissible horizon greater than 1, but does not provide guarantees, as indicated in the violation rate.

The benchmarking results show that the policies computed for ARCS generally achieve competitive cumulative reward to policies computed for C-POMDP, without any constraint violations and thus no pathological behavior. ARCS also generally performs better in all metrics than CPBVI and CPBVI-D, both of which could not search the problem space sufficiently to find good solutions in large RC-POMDPs.

Although the C-POMDP policies generally satisfy the C-POMDP expected cost constraints, the prevalence of high violation rates of C-POMDP policies across the environments strongly suggests that the manifestation of the \emph{stochastic self-destruction} in C-POMDPs is not an exceptional phenomenon, but intrinsic to the C-POMDP problem formulation. This behavior is illustrated in the Tunnels problem, shown in Figure~\ref{fig:tunnels problem}. CGCP (in blue) decides to traverse tunnel $A$ $51\%$ of the time even when it observes that $A$ is rocky, and traverses tunnel $B$ $49\%$ of the time. In contrast, ARCS never traverses tunnel $A$, since such a policy is inadmissible. Instead, it traverses $B$ or $C$ depending on observation of rocks in tunnel $B$, to maximize rewards while remaining admissible.

Finally, the closed-loop inconsistency of C-POMDP policies is evident when comparing open loop CGCP with closed loop CGCP-CL. In most cases (all except CRS(7,8)), the cumulative reward is decreased when going from CGCP to CGCP-CL, sometimes drastically. The violation rate also decreases, but not to $0$, suggesting that planning with C-POMDPs instead of RC-POMDPs can lead to myopic behavior that cannot be addressed by re-planning. As seen in CE and both CRS, CGCP-CL attains lower reward than ARCS while still having constraint violations. Therefore, even for closed-loop planning, RC-POMDP can be more advantageous than C-POMDP.

\subsection*{Unconstrained POMDP problems}

Next, we additionally evaluate how well the RC-PODMP framework and our proposed algorithm performs for problems that have reduced constraints, so as to become equivalent to an unconstrained POMDP. We evaluate ARCS (RC-POMDP), CGCP (C-POMDP algorithm) and SARSOP (unconstrained POMDP algorithm) for the same benchmark problems with very high constraint thresholds $\hat{c} = 1000$.

For these problems, all policies are admissible, and our algorithm is guaranteed to asymptotically converge to the optimal solution. However, since our algorithm needs to keep track of admissible cost values, we utilize a policy tree representation. This representation is less efficient than the $\alpha$-vector policy representation used in SARSOP and CGCP, which allow value improvements at a belief state to directly improve values at other belief states. 

Table~\ref{tab:unconstrained results} reports the lower bound reward and upper bound costs computed by each algorithm, with a time limit of $300s$. As seen in Table~\ref{tab:unconstrained results}, our algorithm performs similar to CGCP and the unconstrained POMDP algorithm SARSOP for most smaller problems. The C-Tiger problem benefits greatly from the $\alpha$-vector representation, since the optimal policy repeatedly cycles among a small set of belief states (which our algorithm considers different augmented belief-admissible cost states). For slightly larger problems (CRS(5,7)), the efficient $\alpha$-vector representation and other heuristics of SARSOP (which CGCP takes advantage of, since it repeatedly calls SARSOP) enables much faster convergence than the policy tree-based method of our approach. Nonetheless, as time is increased, our algorithm slowly improves its values.

\begin{table}[thb]
    \centering
    \scalebox{0.85}{
    \begin{tabular}{l | l | c | c| c} 
    Env. & Algorithm & Reward & Cost\\ 
    \hline
    \multirow{2}{*}{CE} &SARSOP (POMDP) & $\mathbf{12.0}$ & - \\
    &  CGCP (C-POMDP)& $\mathbf{12.0}$ & $5.0$  \\ 
    & Ours (RC-POMDP) & $\mathbf{12.0}$ & $5.0$\\
    \hline
    \multirow{2}{*}{C-Tiger} & SARSOP (POMDP) & $\mathbf{1.93}$ & -\\
    & CGCP (C-POMDP)& $1.90$ & 3.2 \\
   & Ours (RC-POMDP) & -1.4 & 3.2\\\hline
    \multirow{2}{*}{Tunnels} & SARSOP (POMDP) & $\mathbf{1.92}$ & -\\
    &CGCP (C-POMDP) & $\mathbf{1.92}$ & 1.6\\ 
    & Ours (RC-POMDP) & $\mathbf{1.92}$ & 1.6  \\
    \hline
    \multirow{2}{*}{CRS(4,4)} &SARSOP (POMDP) & $\mathbf{16.9}$ & - \\
    & CGCP (C-POMDP) & $\mathbf{16.9}$ & 2.4 \\ 
    & Ours (RC-POMDP) & $\mathbf{16.9}$ & 2.2 \\
    \hline
    \multirow{2}{*}{CRS(5,7)} & SARSOP (POMDP) & $\mathbf{23.9}$ & - \\
    &CGCP (C-POMDP) & $14.8$ & $3.6$\\ 
    & Ours (RC-POMDP) & $14.9$ & $2.1$\\
    \hline
    \multirow{2}{*}{CRS(5,7)} & SARSOP (POMDP) & $\mathbf{24.0}$ & - \\
    &CGCP (C-POMDP) & $24.0$ & $4.5$\\ 
    $1000s$ & Ours (RC-POMDP) & $15.3$ & $2.2$
    \end{tabular}
    }
    
    \caption{Results for computed policy under-approximation (lower bound for reward values and upper bounds for cost values), best reward values in bold. SARSOP only considers reward value as an unconstrained POMDP algorithm.}
\label{tab:unconstrained results}
\end{table}