\section{Recursively-Constrained POMDPs}
To mitigate the pathological behaviors and obtain a (policy-independent) optimal substructure property, we aim to align optimal policies computed at a current belief with optimal policies computed at future (successor) beliefs. We propose a new problem formulation called Recursively-Constrained POMDP (RC-POMDP), which imposes additional recursively defined constraints on a policy. 

An RC-POMDP has the same tuple as a C-POMDP, but with recursive constraints on beliefs at future time steps. 
These constraints enforce that a policy must satisfy a history dependent cumulative expected cost constraint at every future belief state. Intuitively, 
we bound the cost value at every belief such that the constraint in the initial node is respected.

The expected cumulative cost of the trajectories associated with history $h_t$ is given as:
\begin{align}
    \label{eq: W}
    W(h_t) = \sum_{\tau=0}^{t-1}  \gamma^{\tau}  \mathbb{E}_{s_\tau \sim b_\tau}\left[C(s_\tau, a_\tau) \mid h_\tau \right].
\end{align}
We can direct the optimal policy at each time step $t$ by imposing that the total expected cumulative cost satisfies the initial cost constraint $\hat{c}$
% , as a sum of the expected cumulative cost up to time step $t$ and expected future cumulative cost,
. For a given $h_{t}$ and its corresponding $b_{t}$, the expected cumulative cost at $b_0$ is given by:
\begin{align}
    V^\pi_{C \mid h_t}(b_0) = W(h_{t}) + \gamma^{t}V_{C}^{\pi}(b_{t}).
\end{align}
Therefore, the following constraint should be satisfied by a policy $\pi$ at each future belief state:
\begin{align}
    \label{eq:precursor constraints}
    W(h_{t}) + \gamma^{t}V_{C}^{\pi}&\left(b_{t} \right) \leq \hat{c}.
\end{align}

We define the admissibility of a policy $\pi$ accordingly.
\begin{definition}[Admissible Policy]
    A policy $\pi$ is \emph{k-admissible} for a $k \in \mathbb{N}_0 \cup \{\infty \}$ if $\pi$ satisfies Eq.~\eqref{eq:precursor constraints} for all $t \in \{0, \ldots, k-1\}$ and all histories $h_t$ of length $t$ induced by $\pi$ from $b_0$. A policy is called \emph{admissible} if it is $\infty$-admissible.
\end{definition}

Since RC-POMDP policies are constrained based on history, it is not sufficient to directly use belief-based policies. Thus, we consider history-based policies in this work. A history-based policy maps a history $h_t$ to a probability over actions $\Delta(A)$.

The RC-POMDP optimization problem is formalized below.

\begin{problem}[RC-POMDP Planning Problem]
    \label{prob: rcpomdp}
    Given a C-POMDP and an admissibility constraint $k \in \mathbb{N} \cup \{\infty\}$, compute optimal policy $\pi^*$ that is k-admissible, i.e., $\forall h_t$,
        \begin{align}
            &\pi^*(h_t) =  \arg\max _{\pi} V_{R}^{\pi}(h_t) \label{eq: reward objective} \\ 
            & \text { s.t.}\;\; W(h_{t}) + \gamma^{t}V_{C}^{\pi}\left(b_t\right) \leq \hat{c}  \;\; \forall t \in \{0, \dots, k-1\}  \label{eq:pre-recursive constraints}.
        \end{align}
\end{problem}

Note that Problem~\ref{prob: rcpomdp} is an infinite-horizon problem since the optimization objective~\eqref{eq: reward objective} is infinite horizon. The admissibility constraint $k$ is a user-defined parameter. In this work, we focus on $k = \infty$, i.e., admissible policies. 

\begin{remark}
    In POMDPs, reasoning about cost is done on expectation due to state uncertainty.
    C-POMDPs bound the expected total cost of state trajectories, enabling belief trajectories with low expected costs to compensate for those with high expected costs. Conversely, a worst-case constraint formulation of the problem, which never allows any violations during execution, may be overly conservative. RC-POMDPs strike a balance between the two; it bounds the expected total cost for all belief trajectories, only allowing cost violations during execution due to state uncertainty.
\end{remark}