\section{Introduction}

Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making due to their ability to account for transition uncertainty and partial observability. 
Their applications range from autonomous driving \cite{Pendleton2017autonomousvehicles} and robotics to geology \cite{Lauri2023POMDPRobotics, WANG2022groundwater}, asset maintenance \cite{PAPAKONSTANTINOU2014maintenance}, and human-computer interaction \cite{Chen2020trustpomdp}. Constrained POMDPs (C-POMDPs) are extensions of POMDPs that impose a bound on expected cumulative costs while seeking policies that maximize expected total reward. C-POMDPs address the need to consider multiple objectives in applications such as autonomous rover that may have a navigation task as well as an energy usage budget, or human-AI dialogue systems with constraints on the length of dialogues. However, we observe that optimal policies computed for C-POMDPs exhibit pathological behavior in some problems, which can be opposed to the C-POMDP's intended purpose.

\begin{example}[Cave Navigation]
    \label{ex:caveexample}
    Consider a rover agent in a cave with two tunnels, A and B, which may have rocky terrains. Traversing tunnel A has a higher expected reward than traversing tunnel B. To model wheel damage, a cost of $10$ is given for traversing through rocky terrain, and $0$ otherwise. The agent has noisy observations (correct with a probability of $0.8$) of a tunnel's terrain type, and hence, has to maintain a \emph{belief} (probability distribution) over the terrain type in each tunnel. The task is to navigate to the end of a tunnel while ensuring that the expected total cost is below a threshold of $5$. The agent has the initial belief of $0.5$ probability of rocks and $0.5$ probability of no rocks in tunnel $A$, and $0$ probability of rocks and $1.0$ probability of no rocks in tunnel $B$. 
\end{example}

In this example, suppose the agent receives an observation that leads to an updated belief of $0.8$ probability that tunnel $A$ is rocky. Intuitively, the agent should avoid tunnel $A$ since the expected cost of navigating it is $8$, which violates the cost constraint of $5$. However, an optimal policy computed from a C-POMDP decides to go through the rocky region, violating the constraint and damaging the wheels. Such behavior is justified in the C-POMDP framework by declaring that, due to a low probability of observing that tunnel $A$ is rocky in the first place, the expected cost from the initial time step is still within the threshold, and so this policy is admissible. However, this pathological behavior is clearly unsuitable especially for some (e.g., safety-critical) applications.

In this paper, we first provide the key insight that the pathological behavior is caused by the violation of the optimal substructure property over successive decision steps, and hence violation of the standard form of Bellman's Principle of Optimality (BPO). To mitigate the pathological behavior and preserve the optimal substructure property, we propose an extension of C-POMDPs through the addition of history-dependent cost constraints at each reachable belief, which we call Recursively-Constrained POMDPs (RC-POMDPs). We prove that deterministic policies are sufficient for optimality in RC-POMDPs and that RC-POMDPs satisfy BPO. These results suggest that RC-POMDPs are highly amenable to standard dynamic programming techniques, which is not true for C-POMDPs. RC-POMDPs provide a good balance between the BPO-violating expectation constraints of C-POMDPs and constraints on the worst-case outcome, which are overly conservative for POMDPs with inherent state uncertainty.
Then, we present a point-based dynamic programming algorithm to approximately solve RC-POMDPs. Experimental evaluation shows that the pathological behavior is a prevalent phenomenon in C-POMDP policies, and that our algorithm for RC-POMDPs computes polices which obtain expected cumulative rewards competitive with C-POMDPs without exhibiting such behaviors. 

In summary, this paper contributes (i) an analysis that C-POMDPs do not exhibit the optimal substructure property over successive decision steps and its consequences, (ii) the introduction of RC-POMDPs, a novel extension of C-POMDPs through the addition of history-dependent cost constraints, (iii) proofs that all RC-POMDPs have at least one deterministic optimal policy, satisfy BPO, and the Bellman operator has a unique fixed point under suitable initializations, (iv) a dynamic programming algorithm for RC-POMDPs, and (v) a series of illustrative benchmarks to demonstrate the advantages of RC-POMDPs.

\paragraph{Related Work}
Several solution approaches exist for C-POMDPs with expectation constraints \cite{Nijs2021cmdps-survey}. These include offline \cite{Isom2008PiecewiseLDP, kim2011cpbvi, Poupart2015calp, walraven2018cgcp, kalagarla22aNoRegret, Wray2022pga} and online methods \cite{Lee2018ccpomcp, Jamgochian_Corso_Kochenderfer_2023}. These works suffer from the unintuitive behavior discussed above. This paper shows that this behavior is rooted in the violation of optimal substructure by C-POMDPs and proposes a new problem formulation that obeys BPO.

BPO violation has also been discussed in fully-observable Constrained MDPs (C-MDPs) with state-action frequency and long-run average cost constraints \cite{HAVIV199625, CHONG2012108}.  
To overcome it, \citet{HAVIV199625} proposes an MDP formulation with sample path constraints. In C-POMDPs with expected cumulative costs, this BPO-violation problem remains unexplored. Additionally, adoption of the MDP solution of worst-case sample path constraints would be overly conservative for POMDPs, which are inherently characterized by state uncertainty. This paper fills that gap by studying the BPO of C-POMDPs and addressing it by imposing recursive expected cost constraints.

From the algorithmic perspective, the closest work to ours is the C-POMDP point-based value iteration (CPBVI) algorithm~\cite{kim2011cpbvi}. Samples of admissible costs, defined by \citet{PIUNOVSKIY2000} for C-MDPs, are used with belief points as a heuristic to improve computational tractability of point-based value iteration for C-POMDPs. However, since CPBVI is designed for C-POMDPs, the synthesized policies by CPBVI may still exhibit pathological behavior. In this paper, we formalize the use of history-dependent expected cost constraints and provide a thorough analysis of it. We show that this problem formulation eliminates the pathological behavior of C-POMDPs.