
\section{Dynamic Programming for RC-POMDPs}

With the theoretical foundation above, we devise a first attempt at an algorithm that approximately solves Problem~\ref{prob: rcpomdp} with scalar cost and admissibility constraint $k = \infty$. We leave the multi-dimensional and finite $k$ cases for future work. The algorithm is called Admissibility Recursively Constrained Search (ARCS). ARCS takes advantage of the Markovian property of the belief-admissible cost formulation in Proposition~\ref{prop: markovian}, and Theorems~\ref{thm:deterministic rcpomdp}-\ref{thm: fixed point rcpomdp} to utilize point-based dynamic programming in the space of deterministic and admissible policies, building on unconstrained POMDP methods \citep{shani2013survey}.

ARCS is outlined in Algorithm~\ref{alg:RCPBVI}. It takes as input the RC-POMDP $\mathcal{M}$ and $\epsilon > 0$, a target error between the computed policy and an optimal policy at $b_0$. ARCS explores the search space by incrementally sampling points in the history space. These points form nodes in a policy tree $T$. At each iteration, a \texttt{SAMPLE} step expands a sequence of points starting from the root. Then, a Bellman \texttt{BACKUP} step is performed for each sampled node. Finally, a \texttt{PRUNE} step removes sub-optimal nodes. These three steps are repeated until an admissible $\epsilon$-optimal policy is found. Pseudocode for \texttt{SAMPLE}, \texttt{BACKUP} and \texttt{PRUNE} are provided in the appendix.

\begin{algorithm}[t]
    \caption{Anytime Recursively Constrained Search}
    \label{alg:RCPBVI}
    \texttt{ARCS($\mathcal{M}, \epsilon$)}
    \begin{algorithmic}[1]
        \STATE Initialize cost-minimizing policy $\hat{\pi}_{c}^{min} = \Gamma_{c}^{min}$.
       \STATE $(\alpha_r, \alpha_c) \gets  \argmin_{(\alpha_r, \alpha_c) \in \Gamma_{c_{min}}} \alpha_r^T b_0$
        \STATE $\lowervalue \gets \alpha_r^T b_0, \uppercost \gets \alpha_c^T b_0$.
        \STATE Initialize $\uppervalue$ and $\lowercost$ for $b_0$ with FIB.
        \STATE Initialize $k_0$ with Eq.~\eqref{eq: k admissible guarantee}-\eqref{eq: infinite admissibility}.
        \STATE $T \gets v_0 = (b_0, \hat{c}, k_0, \uppervalue, \lowervalue, \uppercost, \lowercost, \emptyset, \emptyset, \emptyset, \emptyset)$.
        \REPEAT
        \STATE $B_{sam} \gets$ \texttt{SAMPLE}($\epsilon$).
        \FORALL{$v \in B_{sam}$}
            \STATE \texttt{BACKUP}($v$).
        \ENDFOR
        \STATE \texttt{PRUNE}().
        \UNTIL{termination conditions are satisfied}
        \STATE \textbf{return} $T, \Gamma_{c_{min}}$.
    \end{algorithmic}
\end{algorithm}

\textbf{Policy Tree Representation \quad} 
We represent the policy with a policy tree $T$. A node in $T$ is a tuple $v = (b, d, k, \uppervalue, \lowervalue, \uppercost, \lowercost, \upperQvalue, \lowerQvalue, \upperQcost, \lowerQcost)$, where $b$ is a belief, $d$ is a history-dependent admissible cost bound, $k$ is a lower bound on admissible horizon, $\uppervalue$ and $\lowervalue$ are the two-sided bounds on reward-values, $\uppercost$ and $\lowercost$ are the two-sided cost-value bounds, $\upperQvalue, \lowerQvalue$ represent two-sided bound on $Q$ reward-value, and $\lowerQcost, \upperQcost$ represent the two-sided bounds on $Q$ cost-value. The root of $T$ is the node $v_0$ with $b = b_0$, $d = \hat{c}$, and admissible horizon lower bound $k_0$.

From Theorem~\ref{thm: fixed point rcpomdp}, a key aspect of effective dynamic programming for RC-POMDPs is computing admissible policies. This can be approximated by minimizing $V_{C}(\hat{b}_t)$. As a pre-processing step, we first approximate a minimum cost-value policy $\pi_{c}^{min} = \arg\inf_{\pi} V^\pi_{C}$. An arbitrarily tight under-approximation (upper bound) $\hat{\pi}_{c}^{min}$ as a set of $|S|$-dimensional hyperplanes, called $\alpha$-vectors, can be computed efficiently with a POMDP algorithm \citet{hauskrecht2000value}. The reward-values obtained by $\hat{\pi}_{c}^{min}$ is also a lower bound on the optimal reward-value. Thus, $\hat{\pi}_{c}^{min}$ is represented by a set of $\alpha$-vector pairs $(\alpha_r, \alpha_c) \in \Gamma_{C}^{min}$. $\hat{\pi}_{c}^{min}$ is used to initialize our policy, and is used from leaf nodes of $T$.

% is used at leaf nodes and histories not in $T$ during execution. 
% We assume that a sufficiently well approximated $\pi_{c_{min}}$ is pre-computed, but further computation can be done on the fly to improve $\pi_{c_{min}}$.

To initialize a new node, belief $b'$ is computed with Eq. \eqref{eq:bayes}, and $d'$ is computed recursively with Eq. \eqref{eq:history dependent cost recursive}. We initialize $\lowervalue$ and $\uppercost$ with $\hat{\pi}_{c}^{min}$, and initialize $\lowercost$ and $\uppervalue$ independently using the Fast Informed Bound (FIB) \cite{hauskrecht2000value}, and $k'$ is a lower bound of the admissible horizon.

\textbf{Admissible Horizon Lower Bound \quad}
\label{sec:admissible horizon} It is computationally intractable to exhaustively search the possibly infinite policy space. Thus, we maintain a lower bound on the admissible horizon of the policy for every node. It is used to compute admissibility beyond the current search depth of the tree, and to improve search efficiency via pruning. To initialize the admissible horizon guarantee of a leaf node, we compute a lower bound on the admissible horizon $k$ when using $\hat{\pi}_{c_{min}}$.
% , which is the minimum cost policy used for leaf nodes and future histories not in $T$. 
% \qh{there is confusion on how non leaf-node $k$.}

\begin{lemma}
    \label{lemma:k-admissible}
    Let the maximum $1$-step cost $C_{max}$ that $\hat{\pi}_{c}^{min}$ incurs at each time step across the entire belief space be
    $C_{max} = \max_{b \in B}C(b,\hat{\pi}_{c}^{min}(b)).$ Then, for a node $v$, if $v.d < 0$, then $k = 0$. For a leaf node $v$ with $v.d \geq 0$, $\hat{\pi}_{c}^{min}$ is at least $k$-admissible with
    \begin{align}
        \label{eq: k admissible guarantee}
        k = \big\lfloor \log \big(1 - ({v.d}/{C_{max}}) \cdot (1-\gamma)\big) / \log(\gamma) \big\rfloor,
    \end{align}
    and $\hat{\pi}_{c}^{min}$ is admissible from history $h$ if
    \begin{align}
        \label{eq: infinite admissibility}
         % \frac{C_{max}}{(1-\gamma)} \leq v.d
         {C_{max}}/{(1-\gamma)} \leq v.d
         \text{\;\; or \;\;} v.\overline{V}^{\hat{\pi}_{c}^{min}}_C = 0 \leq v.d.
    \end{align}
\end{lemma}
% \noindent
A proof is provided in the appendix.
This lemma provides sufficient conditions for admissibility of computed policies. We compute an upper bound on the parameter $C_{max}$,
\begin{align}
    C_{max} \leq V_{C, max}^{\hat{\pi}_{c}^{min}} = \max_{b \in \Delta(S)} \min_{(\alpha_r, \alpha_c) \in \Gamma_{c_{min}}}\alpha_c^T b,
\end{align}
where $\alpha_c$ refers to a cost $\alpha$-vector. This can be solved efficiently with the maximin LP \cite{williams1990model}:
\begin{align}
    \label{eq:maximinLP}
    \max_{z, b} \, z \;\; \text{ s.t. } \;\;\;
\alpha_c^T b \geq z, \;\; (\alpha_r, \alpha_c) \in \Gamma, \;\; b \in \Delta(S).
\end{align}

\textbf{Sampling \quad}
% In point-based value iteration methods, the two main classes of sampling strategies are random search \cite{spaan2005perseus} and heuristic search \cite{Smith2005HSVI, Kurniawati-RSS08-SARSOP}. 
ARCS uses a mixture of random sampling \cite{spaan2005perseus} and heuristic search (SARSOP) \cite{Kurniawati-RSS08-SARSOP}. Our empirical evaluations suggest that this approach is an effective balance between finding policies with high cumulative reward and that are admissible. At each \texttt{SAMPLE} step, ARCS expands the search space from the root of $T$, with either heuristic sampling or random sampling. For heuristic sampling, we use the same sampling strategy and sampling termination condition as SARSOP. It works by choosing actions with the highest $\upperQvalue$, and observations that have the largest contribution to the gap at the root of $T$, and sampling terminates based on a combination of selective deep sampling and a gap termination criterion. With random sampling, actions and observations are chosen randomly while traversing the tree until a new node is reached and added to the tree. Sampled points are chosen for \texttt{BACKUP}. 
\textbf{Backup \quad}
The \texttt{BACKUP} operation at node $v$ updates the values in the node by back-propagating the information of the children of $v$ back to $v$. First, the values of $\upperQvalue, \lowerQvalue$ and $\upperQcost$ are computed for each action using Eq.~\eqref{eq:q-reward} and Eq.~\eqref{eq:q-cost} for rewards and costs, respectively. Then, an RC-POMDP backup Eq.~\eqref{eq:rcbackup} is used to update $\uppervalue, \lowervalue, \lowercost, \uppercost$. The action selected to update $\lowervalue$ is used to update $k$ by back-propagating the minimum $k$ of all children. If no actions are feasible, all current policies from that node are inadmissible, and we update the reward- and cost-values using the action with the minimum $Q$-cost value, and set $k = 0$.

\textbf{Pruning \quad}
% To keep the size of $T$ small, we prune nodes and actions that are suboptimal. Nodes and actions that are pruned are not considered during action and observation selection for \texttt{SAMPLE} and \texttt{BACKUP}. \qh{Bring it here - Due to space considerations, we leave details in the Appendix.}

To keep the size of $T$ small and improve tractability, we prune nodes and node-actions that are suboptimal, using the following criteria. First, for each node $v \in B_{sam}$, if $v.\lowercost > v.d$, no admissible policies exist from $v$, so $v$ and its subtree are pruned. Next, we prune actions as follows. Let $k(v,a)$ be the admissible horizon guarantees of the successor nodes from taking action $a$ at node $v$. Between two actions $a$ and $a'$, if $k(v,a') = \infty \text{ and } v.\upperQvalue(a) < v.\lowerQvalue(a')$, we prune the node-action $(v,a)$ (disallow taking action $a$ at node $v$), since action $a$ can never be taken by the optimal admissible policy. Next, if all node-actions $(v,a)$ are pruned, $v$ is also pruned. Finally, the node-action $(v,a)$ is pruned if any successor node from taking $a$ at $v$ is pruned. Nodes and node-actions that are pruned are not chosen during action and observation selection during \texttt{SAMPLE} and \texttt{BACKUP}.

\begin{proposition}
    \texttt{PRUNE} only removes sub-optimal policies.
\end{proposition}

\paragraph*{Termination Condition}
ARCS terminates when two conditions are met, (i) when it finds an admissible policy, i.e., $v_0.k = \infty$, which is when all leaf nodes $v_{leaf}$ reachable under a policy satisfy Eq.~\eqref{eq: infinite admissibility}, and (ii) when it finds an $\epsilon$-optimal policy, i.e., when the gap criterion at the root is satisfied, that is when $v_0.\uppervalue - v_0.\lowervalue \leq \epsilon$.

\begin{remark}
    ARCS can be modified to work in an anytime fashion given a time limit, and output the best computed policy and its admissible horizon guarantee $v_0.k$.
\end{remark}

\subsection{Algorithm Analysis}

Here, we analyze the theoretical properties of ARCS.

\begin{lemma}[Bound Validity]
    \label{lemma:validity}
     Given an RC-POMDP with admissibility constraint $k = \infty$, let $T$ be the policy tree after some iterations of ARCS. Let $V_R^*$ be the reward-value of an optimal admissible policy. At every node $v$ with $\bar{b} = (b, d)$ and admissible horizon guarantee $v.k = \infty$,
     it holds that: \\
     $v.\lowervalue \leq V_R^*(\bar{b}) \leq v.\uppervalue$ \;\; and \;\; $ v.\uppercost \leq d.$
\end{lemma}

\begin{theorem}[Soundness]
    \label{thm:soundness}
    Given an RC-POMDP with admissibility constraint $k = \infty$ and $\epsilon$, if ARCS terminates with a solution, the policy is admissible and $\epsilon$-optimal.
\end{theorem}

ARCS is not complete. It may not terminate, due to conservative computation of admissible horizon and needing to search infinitely deep to find admissible policies for some problems. However, ARCS can find $\epsilon$-optimal admissible policies for many problems, such as the ones in our evaluation. These are problems where a finite depth is sufficient to compute admissibility even with conservatism. We leave the analysis of such classes of RC-POMDPs to future work.