\section{Constrained POMDPs}

POMDPs model sequential decision making problems under transition uncertainty and partial observability. 

\begin{definition}[POMDP]
    \label{def:pomdp}
    A \emph{Partially Observable Markov Decision Process} (POMDP) is a tuple  $\mathcal{P} = (S, A, O, T, R, Z, \gamma, b_0)$, where:
    $S, A,$ and $O$ are finite sets of states, actions and observations, respectively, 
    $T : S \times A \times S \rightarrow [0,1]$ is the transition probability function,
    $R : S \times A \rightarrow [R_{min}, R_{max}]$, for $R_{min}, R_{max} \in \mathbb{R}$, is the immediate reward function,
    $Z : S \times A \times O \rightarrow [0,1]$ is the probabilistic observation function,
    $\gamma \in [0,1)$ is the discount factor,
    and $b_0 \in \Delta(S)$ is an initial belief, where $\Delta(S)$ is the probability simplex (the set of all probability distributions) over $S$.
\end{definition}
\noindent
We denote the probability distribution over states in $S$ at time $t$ by $b_t \in \Delta(S)$ and the probability of being in state $s$ at time $t$ by $b_t(s)$.

The evolution of an agent according to a POMDP model is as follows.
At each $t \in \mathbb{N}_0$, the agent has a belief $b_t$ of its state $s_t$ as a probability distribution over $S$ and takes action $a_t \in A$. Its state evolves from $s_t \in S$ to $s_{t+1} \in S$ according to $T(s_t,a_t,s_{t+1})$, and it receives an immediate reward $R(s_t,a_t)$ and observation $o_{t} \in O$ according to observation probability
% $P(o_{t+1} | s_t, a_t, s_{t+1}) =$  
$Z(s_{t+1}, a_t, o_{t})$. The agent then updates its belief 
% $b_{t+1}(s) = P(s_{t+1} = s | h_t)$, 
% and $b_{t+1} = \tau(b_t,a_t,o_t)$ denotes the successor of belief $b$ when taking action $a$ and receiving observation $o$, and can be computed recursively 
%that is, for $s_{t+1} = s'$,  $b_{t+1}(s') \propto Z(s', a_t, o_{t}^+) \sum_{s \in S} b_{t}(s) T(s,a_t,s')$.
using Bayes theorem;
that is for $s_{t+1} = s'$, 
\begin{align}
    \label{eq:bayes}
    b_{t+1}(s') \propto Z(s', a_t, o_{t}) \sum_{s \in S} T(s,a_t,s') b_{t}(s) .
\end{align}

Then, the process repeats. Let $h_{t} = \{a_0, o_0, \cdots, a_{t-1}, o_{t-1}\}$ denote the history of the actions and observations up to but not including time step $t$; thus, $h_0 = \emptyset$. The belief at time step $t$ is therefore $b_{t} = P(s_{t} \mid b_0, h_t)$. For readability, we do not explicitly include $b_0$, as all variables are conditioned on $b_0$.

The agent chooses actions according to a policy $\pi: \Delta(S) \to \Delta(A)$, which maps a belief $b$ to a probability distribution over actions. $\pi$ is called \emph{deterministic} if $\pi(b)$ is a unitary distribution for every $b \in \Delta(S)$. A policy is typically evaluated according to the expected rewards it accumulates over time.
Let $R(b,a) = \mathbb{E}_{s \sim b}[R(s,a)]$ be the expected reward for the belief-action pair $(b,a)$. 
The \textit{expected discounted sum of rewards} that the agent receives under policy $\pi$ starting from belief $b_t$ is
\begin{align}
    \label{eq: total reward}
    V_{R}^{\pi}(b_t) &= \mathbb{E}_{\pi, T, Z} \Big[ \sum_{\tau=t}^{\infty} \gamma^{\tau - t} R\left(b_{\tau}, \pi(b_{\tau})\right) \mid b_t \Big].
\end{align}
Additionally, the $Q$ reward-value is defined as
\begin{align}
    \label{eq:q-reward}
    Q^\pi_R(b_t,a)  &= R(b_t, a) + \gamma \, \mathbb{E}_{T,Z}[V^\pi_R(b_{t+1})].
\end{align}
The objective of POMDP problems is often to find a policy that maximizes $V_R^{\pi}(b_0)$.

As an extension of POMDPs, Constrained POMDPs add a constraint on the expected cumulative costs.
\begin{definition}[C-POMDP]
    \label{def:cpomdp}
    A \emph{Constrained POMDP} (C-POMDP) is a tuple $\mathcal{M} = (\mathcal{P}, C, \hat{c})$, where $\mathcal{P}$ is a POMDP as in Def.~\ref{def:pomdp}, 
    $C: S \times A \rightarrow \mathbb{R}^n_{\geq 0}$ is a cost function that maps each state action pair to an $n$-dimensional vector of non-negative costs, and
    $\hat{c} \in \mathbb{R}^n_{\geq 0}$ is an $n$-dimensional vector of expected cost thresholds from the initial belief state $b_0$.
\end{definition}

In C-POMDPs, by executing action $a \in A$ at state $s \in S$, the agent receives a cost vector $C(s,a)$ in addition to the reward $R(s,a)$. Let $C(b,a) = \mathbb{E}_{s \sim b}[C(s,a)]$.
The expected sum of costs incurred by the agent under $\pi$ from belief $b_t$ is:
\begin{align}
    \label{eq: total cost}
    V^{\pi}_C(b_t) = \mathbb{E}_{\pi, T, Z} \Big[ \sum_{\tau=t}^\infty \gamma ^{\tau - t} C(b_\tau, \pi(b_\tau)) \mid b_t \Big].
\end{align}
\noindent
Additionally, the $Q$ cost-value is defined as
\begin{align}
    \label{eq:q-cost}
    Q^\pi_C(b_t,a)  &= C(b_t, a) + \gamma \, \mathbb{E}_{T,Z}[V^\pi_C(b_{t+1})].
\end{align}

In C-POMDPs, the constraint $V^\pi_C(b_0) \leq \hat{c}$, where $\leq$ refers to the component-wise inequality, is imposed on the POMDP optimization problem as formalized below.
\begin{problem}[C-POMDP Planning Problem]
    \label{prob: cpomdp}
    Given a C-POMDP, compute policy $\pi^*$ that maximizes total expected reward in Eq.~\eqref{eq: total reward} from initial belief $b_0$ while the total expected cost vector in Eq.~\eqref{eq: total cost} is bounded by $\hat{c}$, i.e.,
    \begin{equation}
    \label{eq:original problem}
        \pi^* = \arg \max _{\pi} V_{R}^{\pi}(b_{0}) \quad \text { s.t. } \quad V_{C}^{\pi}\left(b_{0}\right) \leq \hat{c}.
    \end{equation}
\end{problem}

Unlike POMDPs that have at least one deterministic optimal policy
\cite{Sondik1978pomdp}, optimal policies of C-POMDPs may require randomization, and hence there may not exist an optimal deterministic policy \cite{kim2011cpbvi}.

Next, we discuss why the solutions to Problem~\ref{prob: cpomdp} may not be desirable and an alternate formulation is necessary.

\subsection{Optimal Substructure Property}

A problem has the optimal substructure property if \emph{an optimal solution to the problem contains optimal solutions to its subproblems} \citet{algorithmsbook2009cormen}. Additionally, \citeauthor{algorithmsbook2009cormen} note that these subproblems must be independent of each other. If this holds for Problem~\ref{prob: cpomdp}, then the optimal policy $\pi^*(b_0)$ at $b_0$ can be computed recursively by finding the optimal policy $\pi^*(h_t)$ for each successive history \emph{for the same planning problem}. Thus, a natural subproblem to Eq.~\eqref{eq:original problem} is the history-based subproblem $(\mathcal{M}, h_t)$, with $\pi^*(h_t) = \arg \max _{\pi} V_{R}^{\pi}(h_t) \text { s.t. } V_{C}^{\pi}\left(b_{0}\right) \leq \hat{c}$\footnote{Constraining $V_C^{\pi}(h_t) \leq \hat{c}$ also violates the property as the constraint is defined only at $b_0$.}. We show that this subproblem violates the optimal substructure property, which makes the employment of standard dynamic programming techniques difficult\footnote{Some approaches use dynamic programming (\cite{Isom2008PiecewiseLDP}, \cite{kim2011cpbvi}), but they do not find optimal policies.}.

Since the constraint of Eq.~\eqref{eq:original problem} is defined only at $b_0$, the subproblem at $h_t$ must consider the expected cumulative cost of the policy from $b_0$. It is not enough to compute the expected total cost obtained from $b_0$ to $h_t$, as an optimal cost-value from $h_t$ depends on cost-values of other subproblems. We illustrate this with an example. Consider the POMDP (depicted as a belief MDP) in Figure~\ref{fig:counterexample}, which is a simplified version of Example~\ref{ex:caveexample}. W.l.o.g., let $\gamma = 1$. The agent starts at $b_0$ with constraint $\hat{c} = 5$. Actions $a_A$ and $a_B$ represent going through tunnels A and B, and $r$ and $nr$ are the observations that tunnel A is rocky and not rocky, respectively.

    \begin{figure}[thb]
    \begin{minipage}{.49\linewidth}
    \begin{tikzpicture}[node distance=1.2cm, auto, every state/.style={circle, draw, minimum size=0.5cm}]
   % Belief nodes
   \node[state] (b1) {$b_0$};
   \node[state, below left of=b1, rectangle] (a1) {$a_A$};
   \node[state, below left of=a1] (b2) {$b_1$};
   \node[state, below right of=a1] (b3) {$b_2$};
   \node[state, below right of=b1, rectangle] (a2) {$a_B$};
   \node[state, below left of=b2, rectangle] (a3) {$a_A$};
   \node[state, right of=a3, rectangle] (a4) {$a_B$};
    \node[state, below right of=b3, rectangle] (a6) {$a_B$};
    \node[state, left of = a6, rectangle] (a5) {$a_A$};
    \node[state, below of= a6] (b4) {$b_3$};
   % Transitions
   \path
   (b1) edge[->] node{} (a2)
        edge[->]  node{} (a1)
    (a2) edge[bend left,->] node[right]{\tiny $1$} (b4)
   (a1) edge[->]  node[sloped, above]{\footnotesize $r\,$} node[sloped, below]{\footnotesize $0.5$} (b2)
   (a1) edge[->]  node[sloped, above]{\footnotesize $nr$} node[sloped, below]{\footnotesize $0.5$} (b3)
   (b2) edge[->]  node{} (a3)
        edge[->]  node{} (a4)
   (b3) edge[->]  node{} (a5)
        edge[->]  node{} (a6)
      (a3.south) edge[->]  node[left,below]{\tiny $1$} (b4)
   (a4.south) edge[->]  node[right]{\tiny $\;\;1$} (b4)
      (a5) edge[->]  node[right]{\tiny $1$} (b4)
   (a6) edge[->]  node[right]{\tiny {\!\!1}} (b4)
   (b4) edge[loop right] node {} (b4);
    \end{tikzpicture}
    \end{minipage}
    \qquad 
    \begin{minipage}{.4\linewidth}
    \renewcommand{\arraystretch}{1.3}
    \begin{tabular}{c|cc}
    $R$ & $a_A$ & $a_B$   \\
    \hline
    $b_0$ & 0 & 10 \\
    $b_1$ & 12  & 0 \\
    $b_2$ & 12  & 0 \\
    \end{tabular}
    
    \begin{tabular}{c|cc}
    $C$ & $a_A$ & $a_B$   \\
    \hline
    $b_0$ & 0 & 5 \\
    $b_1$ & 8 & 5 \\
    $b_2$ & 2  & 5
    \end{tabular}
    \end{minipage}
    \caption{Counter-example POMDP with associated reward and cost functions. The action at $b_3$ has $0$ reward and cost.}
    \label{fig:counterexample}
\end{figure}

By examining the reward function, we see that action $a_A$ returns the highest reward everywhere except $b_0$. Action $a_B$ returns a higher reward at $b_0$. Let $\pi_A$ be the policy that chooses $a_A$ at every belief, and $\pi_B$ the one that chooses $a_B$ at $b_0$. 
The cost-values for these policies are
$V^{\pi_A}_C(b_0) = V^{\pi_B}_C(b_0) = 5 \leq \hat{c}$, and the reward-values are $V^{\pi_A}_R(b_0) = 12$, $ V^{\pi_B}_R(b_0) = 10$.
Note that both policies satisfy the constraint and any policy that chooses $a_B$ at $b_1$ or $b_2$, or that randomizes between $\pi_A$ and $\pi_B$ has value less than $V^{\pi_A}_R(b_0)$; hence, $\pi_A$ is the optimal policy. However, when planning at $b_1$, i.e., $h_1$, it is impossible to decide that $a_A$ is optimal without first knowing that action $a_A$ at $h_2$ incurs $2$ cost and is optimal. The decisions at $b_1$ and $b_2$ cannot be computed separately as subproblems.

To get around this dependence, we can include information about how much cost the policy incurs at other subproblems and how much cost policies can incur from $h_t$, obtaining a \emph{policy-dependent} subproblem $(\mathcal{M}, h_t, \pi)$. This subproblem definition exhibits the optimal substructure property only if we relax the restriction of subproblems being independent. Nonetheless, the optimal solution to a subproblem $(\mathcal{M}, h_t, \pi)$ is only guaranteed to be optimal for the full problem if an optimal policy $\pi^*$ is already provided.

\subsubsection{Pathological Behavior
: Stochastic Self-Destruction}

A main consequence of history-dependent subproblems violating the optimal substructure property and instead requiring policy-dependent subproblems is that optimal policies may exhibit unintuitive behaviors during execution.

In the above example, the optimal policy from $b_0$ first chooses action $a_A$. Suppose that $h_1$ is reached. The cost constraint at $b_1$ remains at $5$ since no cost has been incurred. However, the optimal C-POMDP policy chooses action $a_A$ and incurs a cost of $8$ which violates the constraint, even though there is another action, $a_B$, that incurs a lower expected cost that satisfies the constraint. Therefore, in $50\%$ of executions, when $h_1$ is reached, the agent intentionally violates the cost constraint to get higher expected rewards, even if a policy that satisfies the cost constraint exists. We term this pathological behavior \emph{stochastic self-destruction}.

This unintuitive behavior is mathematically correct in the C-POMDP framework because the policy still satisfies the constraint at the initial belief state on expectation. An optimal C-POMDP policy exploits the nature of the constraint in Eq.~\eqref{eq:original problem}
to intentionally violate the cost constraint for some belief trajectories. A concrete manifestation of this phenomenon is in the stochasticity of the optimal policies for C-POMDPs. These policies randomize between deterministic policies that violate the expected cost threshold but obtain higher expected reward, and those that satisfy the cost threshold but obtain lower expected reward.

Another consequence is a mismatch between optimal policies planned from a current time step and optimal policies planned at future time steps. In the example in Figure~\ref{fig:counterexample}, if re-planning is conducted at $b_1$, the re-planned optimal policy selects $a_B$ instead of $a_A$. In fact, the policy that initially takes $a_B$ at $b_0$ achieves a higher expected reward than the original policy that takes $a_A$ at $b_0$ and re-plans at future time steps. This phenomenon can therefore lead to poor performance of the closed-loop system during execution.

\begin{remark}
    We remark that the pathological behavior arises due to the C-POMDP problem formulation, and not the algorithms designed to solve C-POMDPs. Further, this issue cannot be addressed by simply restricting solutions to deterministic policies since they also exhibit the pathological behavior, as seen in the example in Figure~\ref{fig:counterexample}.
\end{remark}