%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
\usepackage{xr}
% \usepackage[british]{babel}



\usepackage{amssymb,latexsym,amsfonts,amsmath,amsthm,mathrsfs}
\usepackage{bbm}
\usepackage{multicol}
\usepackage{multirow}
%\usepackage{algorithmic}
%\usepackage{cite}
\usepackage{graphicx}
\graphicspath{ {./images/} }
%\usepackage{enumerate}
\usepackage{url}
\usepackage{caption}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{dsfont}
\usepackage{algorithm,algorithmic}
\usepackage{nidanfloat}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{comment}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{assumption}{Assumption}
\newtheorem{remark}{Remark}
\newtheorem{properties}{Properties}[section]
\newtheorem{example}{Example}[section]
\newtheorem{corollary}{Corollary}[section]
\newtheorem{proposition}{Proposition}
\newtheorem{claim}{Claim}[section]
\newtheorem{observation}{Observation}[section]

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\newcommand{\Py}{{\mathbb P}}
\newcommand{\E}{{\mathbb E}}
\newcommand{\R}{{\mathbb R}}
\newcommand{\U}{{\mathcal U}}
\newcommand{\Y}{{\mathcal Y}}
\newcommand{\X}{{\mathcal X}}
\newcommand{\G}{{\mathcal G}}
\newcommand{\ltlf}{\textsc{LTL}_f}
\newcommand{\rmax}{r_\textsc{max}}
\newcommand{\supp}{\mathrm{supp}}

\newcommand{\dk}[1]{\normalsize{{\color{magenta}(DK:\ #1)}}}
\newcommand{\kris}[1]{\normalsize{{\color{blue}(KR:\ #1)}}}
\newcommand{\dkm}[1]{\normalsize{{\color{orange}(DK:\ #1)}}}
\newcommand{\kri}[1]{\normalsize{{\color{green}(PN:\ #1)}}}
\newcommand{\RJ}[1]{\normalsize{{\color{red}(RJ:\ #1)}}}

\newcommand{\pn}[1]{\normalsize{{\color{cyan}(PN:\ #1)}}}


\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\myexternaldocument{kalagarla_619-supp}
%\title{Synthesis of Optimal Policies Under  Linear  Temporal  Logic  Specifications and Partial Observability}
\title{Optimal Control of Partially Observable Markov Decision Processes with Finite Linear Temporal Logic Constraints}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Krishna C. Kalagarla}{}}
\author[1]{Dhruva Kartik}
\author[1]{Dongming Shen}
\author[1]{Rahul Jain}
\author[1]{Ashutosh Nayyar}
\author[1]{Pierluigi Nuzzo}
%\author[1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Department of Electrical and Computer Engineering\\
    University of Southern California\\
    Los Angeles, CA, USA
}

  
  \begin{document}
\maketitle

\begin{abstract}
Autonomous agents often operate in environments where the state is partially observed. In addition to maximizing their cumulative reward, agents must execute complex tasks with rich temporal and logical structures. These tasks can be expressed  using temporal logic languages like finite linear temporal logic ($\ltlf$). This paper, for the first time, provides a structured framework for designing agent policies that maximize the reward while ensuring that the probability of satisfying the temporal logic specification is sufficiently high. We reformulate the problem as a constrained partially observable Markov decision process (POMDP) and  provide a novel approach that can leverage off-the-shelf unconstrained POMDP solvers for solving it. Our approach guarantees approximate optimality and constraint satisfaction with high probability. We demonstrate its effectiveness by implementing it on several models of interest.
\end{abstract}

\section{Introduction}\label{sec:intro}


 Markov Decision Processes (MDPs)~\citep{Puterman:1994:MDP:528623} can model a wide range of scenarios involving sequential decision-making in dynamically evolving environments. They are often used in settings like robotics, cyber-physical systems, and safety-critical autonomous systems. Traditional planning in MDPs involves a reward structure over the state-action space whose cumulative sum over the time-horizon is maximized to achieve a desired objective. This approach has been successful for tasks like reachability and obstacle avoidance. However, designing an appropriate reward function can at times be tricky, and an incorrect reward formulation can easily lead to unsafe and undesired behaviors. This is primarily due to the fact that instantaneous rewards in MDPs depend only on the current \emph{system state} and the agent's current action. When the agent's task is characterized by complex temporal objectives, the agent needs to track the \emph{status of the task} it is performing in addition to the system state. One might be able to incorporate some of the simpler task specifications by appropriately modifying the MDP model (e.g., by adding an absorbing state that denotes obstacle collision). However, manually constructing an MDP reward function that captures substantially complicated specifications is not always possible.
 
To  overcome this issue, increasing attention has been directed over the past decade towards leveraging temporal logic specifications~\citep{baier2008principles} and formal methods to formulate and solve control and planning problems in the presence of uncertainty. Several temporal logics exist that are capable of capturing a wide range of task specifications, including surveillance, reachability, safety, and sequentiality. The synthesis of MDP policies which maximize the probability of satisfaction of temporal logic specifications has also been extensively studied~ \citep{ding2011ltl,lahijanian2011control,aksaray2016q}. However, while certain objectives are well expressed by temporal logic constraints, others are better framed as a ``soft''  reward maximization task. Therefore, several recent efforts~\citep{kalagarla2021optimal,kalagarla2021model,guo2018probabilistic} have focused on reward maximization objectives for MDPs together with temporal logic constraints. 

MDPs model environments where the states are fully observable and do not account for many real-life scenarios with partial state observability. These scenarios can instead be captured by Partially Observable Markov Decision Processes (POMDPs). Unfortunately, however, the aforementioned methods for synthesizing policies that satisfy temporal logic specifications in MDPs cannot be directly applied to the setting of POMDPs. In theory, any POMDP can be translated into an equivalent MDP whose state is the agent's posterior belief on the system state \citep{bertsekas1995dynamic}. However, the reachable belief space grows exponentially with the time horizon. Due to this extremely large belief space, the synthesis methods developed for MDPs become intractable in the context of POMDPs.

% Moving away from the classical reward maximization paradigm, \citep{wang2018bounded,wang2021online} consider the objective of safe reachability for POMDPs and leverage a satisfiability modulo theory (SMT) solver  to find a valid policy over the reachable belief space. Our approach is more general, in that it considers more complex temporal logic tasks together with reward maximization objectives.
%  But this objective is less expressible as opposed to temporal logic and does not handle additional reward maximization.
Recently, a few approaches have been proposed to address the complexity issues that arise in POMDP planning for temporal logic specifications.
 The focus of these approaches is to \textit{maximize the satisfaction} of temporal logic specifications. They include simulations over the belief space \citep{haesaert2018temporal}, discretization of the belief space \citep{norman2015verification}, and restricting the space of policies to finite state controllers \citep{ahmadi2020stochastic, sharan2014finite, chatterjee2015qualitative}. However, none of the above approaches addresses temporal logic and reward maximization objectives simultaneously. Lately, deep recurrent neural network based approaches \citep{carr2020verifiable,carr2019counterexample} have also been proposed to handle POMDPs with temporal logic specifications. 


In this paper, we address this problem by expanding the traditional POMDP framework to incorporate temporal logic specifications. Specifically, we aim to design policies for the agent such that the agent's reward is maximized while ensuring that the temporal logic specification is satisfied with high probability. Our focus is on processes which eventually stop, but we allow for the stopping time of the process to be random. The rewards are accumulated and the temporal logic specification must be satisfied over the duration of the process. 

We focus on finite linear temporal logic ($\ltlf$)~\citep{de2013linear}, a temporal extension of propositional logic, to express complex task specifications. $\ltlf$ is a variant of linear temporal logic (LTL)~\citep{baier2008principles}, interpreted over finite traces. In $\ltlf$, one can start with simple atomic predicates and compose them using operators such as conjunction, negation, ``until,'' ``always,'' if-then, ``next'' (immediately), to obtain richer specifications. For example, starting with the atomic predicates ``injured individual found,'' ``seek help,'' and ``hit obstacle,'' we can construct the specification ``Always do not (hit obstacle) and, if (injured individual found), then immediately (seek help).''
% A key property of $\ltlf$ specifications is that 
Given an $\ltlf$ specification, a deterministic finite automaton (DFA) can be constructed such that the agent's trajectory satisfies the specification if and only if it is accepted by the DFA \citep{zhu2017symbolic}. The internal state of this DFA essentially tracks the status of the task associated with the $\ltlf$ formula. The key idea underlying our approach is that augmenting the system state with the DFA's internal state enables us to track both the system as well as the status of our task. We can then simultaneously reason about the POMDP rewards and the temporal logic specification by formulating the planning problem as a constrained POMDP problem.

We provide a scheme which can use any off-the-shelf unconstrained POMDP solver \citep{kurniawati2008sarsop,somani2013despot,silver2010monte} to solve the constrained POMDP problem, thus leveraging existing results from unconstrained POMDP planning. This idea of leveraging well-studied unconstrained POMDP planners was also used to find policies maximizing temporal logic satisfaction in POMDPs~\citep{liuleveraging,bouton2020point}. 

There are a few other approaches for solving constrained POMDPs. One such approach is to iteratively construct linear programs \cite{poupart2015approximate} which results in an approximate solution for the constrained POMDP problem.  However, this method has been shown to suffer from scalability issues \cite{lee2018monte}. A primal-dual approach based on Monte Carlo Tree Search (MCTS) \cite{lee2018monte} has been used to address these scalability issues.  We solve the constrained POMDP problem using a similar primal-dual method.  A key difference is that, instead of using an MCTS approach, we use an approximate unconstrained POMDP solver, SARSOP \cite{kurniawati2008sarsop}.   This solver returns policies for unconstrained POMDPs along with bounds on their optimality gaps. This enables us to establish a concrete relationship between the number of iterations required and the approximation error using principles from no-regret learning. 
Column generation algorithms \cite{walraven2018column} also use a similar primal-dual approach, but with a different dual parameter update procedure. In these algorithms, convergence to optimality is shown, but the number of iterations required to get an approximate solution is not known. Our method, on the other hand, gives a precise relationship between the approximation error and the number of iterations. 
 
 To the best of our knowledge, this is the first paper on the synthesis of reward optimal POMDP policies with temporal logic constraints. Our contributions can be summarized as follows:
\begin{enumerate}
    \item We formulate a novel problem of reward maximization in POMDPs under $\ltlf$ constraints. This formulation can incorporate several non-trivial specifications such as ordering, reactivity, etc., which cannot be well expressed by classical POMDP reward-like constraints.
    
    \item For POMDPs that stop in finite time almost surely, we provide a structured methodology for synthesizing approximately optimal policies which maximize a cumulative reward under the constraint that the probability of satisfying a temporal logic specification stated as an $\ltlf$ formula is beyond a desired threshold.
    
    \item We construct a constrained product POMDP expressing both the reward maximization and temporal logic objectives. We show that solving this constrained POMDP is equivalent to solving the original POMDP problem with the $\ltlf$ constraint.
    
    \item For a large class of stopping times, we provide a planning scheme to solve the constrained POMDP. This scheme can leverage any off-the-shelf approximate solver that can solve \emph{unconstrained} POMDPs with stopping times. Different from current works on constrained POMDPs, we provide theoretical guarantees on the near-optimality of the returned policy by using a no-regret online learning approach.
    
    \item Unconstrained POMDP solvers in a general stopping time setting are uncommon. We describe two specific models of stopping times for which existing POMDP solvers can be used: (i) fixed-horizon stopping and (ii) geometric stopping. Our algorithm employs   a finite-horizon POMDP solver under case (i) and a \emph{discounted} infinite-horizon POMDP solver under case (ii).
    
    \item We apply our approach to numerically solve several models and discuss its effectiveness.
\end{enumerate}

% Partially Observable Markov Decision Processes (POMDPs) can model a wide range of scenarios involving sequential decision-making in dynamically evolving environments with partial state observability. POMDPs are often used in settings like robotics, cyber-physical systems, safety-critical autonomous systems, etc. Several techniques exist in the literature \citep{somani2013despot,lee2018monte,khonji2019approximability} that can compute near-optimal policies for POMDPs. These methods have been successful in tasks like reachability, obstacle avoidance,  etc. However, they are not suitable for tasks with complex temporal and logical structure. This is primarily because the instantaneous rewards and constraints in POMDPs depend only on the current \emph{system state} and the agent's current action. However, when the agent's task is characterized by complex temporal objectives, the agent needs to track the \emph{status of the task} it is performing in addition to the system state. One might be able to incorporate some of the simpler specifications by appropriately modifying the POMDP (like adding an absorbing state that denotes obstacle collision etc.). However, manually constructing a POMDP reward function that captures substantially complicated specifications is not always possible. 

% To  overcome this issue, increasing attention has been directed over the past decade or so towards using temporal logics \citep{baier2008principles} to unambiguously capture complex design objectives. Our main goal in this paper is to expand the POMDP framework to incorporate temporal logic specifications. More explicitly, we would like to design policies for the agent such that the agent's reward is maximized while ensuring that the temporal logic specification is satisfied with high probability. Our focus is on processes which eventually stop. We allow the stopping time of the process to be random. The reward (and reward-like constraints) are accumulated and the temporal logic specification must be satisfied over the duration of the process.

%  Several temporal logics exist that are capable of capturing a wide range of task specifications. In this paper, we focus on finite linear temporal logic ($\ltlf$)~\citep{de2013linear}, a temporal extension of propositional logic to express complex task specifications. This is a variant of linear temporal logic (LTL)~\citep{baier2008principles} interpreted over finite traces. In $\ltlf$, one can start with simple atomic predicates and compose them using temporal logic operators such as conjunction, negation, until, always, if-then, next (immediately) etc. to obtain richer specifications. For example, starting with the atomic predicates ``injured individual found", ``seek help" and ``hit obstacle", we can construct the specification ``Always do not (hit obstacle) and if (injured individual found) then immediately (seek help)".
% A key property of $\ltlf$ specifications is that given an $\ltlf$ specification, a deterministic finite automaton (DFA) can be constructed such that the agent's trajectory satisfies the specification if and only if it is accepted by the DFA \citep{zhu2017symbolic}. The internal state of this DFA essentially tracks the status of the task associated with $\ltlf$ formula. The core of our thesis is that augmenting the system state with DFA's internal state enables us to track both the system as well as the status of our task and thus, allowing us to simultaneously reason about the POMDP rewards and the temporal logic specification.

% The main contributions in this paper are as follows:
% \begin{enumerate}
%     \item For POMDPs that stop in finite time almost surely, we provide a structured methodology for synthesizing optimal policies which maximize a cumulative reward under the constraint that the probability of satisfying a temporal logic specification stated as an $\ltlf$ formula is beyond a desired threshold.
    
%     \item We construct a constrained product POMDP expressing both the reward maximization and temporal logic objectives. We show that solving this constrained POMDP is equivalent to solving the original POMDP problem with the $\ltlf$ constraint.
    
%     \item For a large class of stopping times, we provide a planning scheme to solve the constrained POMDP. This scheme can leverage any off-the-shelf approximate solver that can solve \emph{unconstrained} POMDPs with stopping times. Different from current works on constrained POMDPs, we provide theoretical guarantees on the optimality of the returned policy by using a no-regret online learning approach.
    
%     \item Unconstrained POMDP solvers in a general stopping time setting are uncommon. We describe two specific models of stopping times for which existing POMDP solvers can be used: (i) fixed horizon stopping and (ii) geometric stopping. Our algorithm under case (i) can employ a finite-horizon POMDP solver and under case (ii), employ a \emph{discounted} infinite-horizon POMDP solver.
%     \item We apply our approach to numerically solve several models and discuss its effectiveness.
    
% \end{enumerate}


% % \begin{enumerate}
% %     \item $\ltlf$ formulas can be used to specify, in a declarative manner, a wide range of temporal properties that the agent should satisfy. Such properties cannot directly be captured by a standard POMDP model. We provide a methodology to transform this $\ltlf$ spec into an equivalent constrained POMDP, i.e., with reward constraints. This is done by first obtaining the DFA associated with the $\ltlf$ spec and then forming the product POMDP.
% %     \item For a large class of stopping times, we provide a no-regret learning based methodology to solve the constrained POMDP. This method only requires access to an approximate solver that can solve \emph{unconstrained} POMDPs with stopping times. The resulting strategy is a mixture of at most \emph{two} pure strategies.
% %     \item General stopping time POMDP solvers are uncommon. We consider two specific models of stopping: (i) fixed horizon stopping and (ii) geometric stopping. Case (i) reduces to a finite-horizon POMDP solver and case (ii) reduces to a \emph{discounted} infinite-horizon POMDP solver.
% %     \item We use our approach for solving several models of interest.
% % \end{enumerate}

% \subsection{Related Work} \label{sec:related}

% POMDPs are extensively used to model various situations in AI and planning. Many POMDP algorithms have been developed \citep{kurniawati2008sarsop,silver2010monte,somani2013despot} to improve tractability and efficiency of unconstrained POMDP planning. We benefit from these advances in POMDP planning as we provide a scheme which can leverage any off-the-shelf POMDP solver.

% Different from the traditional POMDP planners which focus on reward maximization, \citep{wang2018bounded,wang2021online} consider the objective of safe-reachability for POMDPs. This is achieved by employing SMT solvers over the reachable belief space to find a valid policy. But, this objective is less expressible as opposed to temporal logic and does not handle reward maximization.

% Motivated by the expressibility of temporal logic, synthesis of MDP policies which maximize the probability of satisfaction of temporal logic specifications has been studied extensively
% \citep{ding2011ltl,lahijanian2011control,aksaray2016q}. Recently, several works \citep{kalagarla2021optimal,kalagarla2021model, guo2018probabilistic} have considered reward objectives for MDPs in addition to the temporal logic constraints. But these methods cannot be tractably applied to the setting of POMDPs where the states are not fully observable.  

% There have been several different approaches for synthesis of POMDP policies \textit{maximizing the satisfaction} of temporal logic specifications.
% They include simulations over the belief space \citep{haesaert2018temporal}, discretization of the belief space \citep{norman2015verification} and restricting the space of policies to finite state controllers \citep{ahmadi2020stochastic, sharan2014finite, chatterjee2015qualitative}. Similar to our approach, \citep{liuleveraging,bouton2020point} leverage advances in POMDP planning algorithms to find policies maximizing temporal logic satisfaction.  Lately, even recurrent neural network based approaches \citep{carr2020verifiable,carr2019counterexample} have also been proposed to handle POMDPs with temporal logic specifications. But, they lack proof of correctness and currently handle only temporal logic objectives. None of the above works handle both temporal logic and reward maximization objectives together in the context of POMDPs. 


% To the best of our knowledge, this is the first paper on synthesis of POMDP policies with temporal logic constraints.

\section{Preliminaries} \label{sec:prelim}



We denote the sets of real and natural numbers by $\mathbb{R}$ and $\mathbb{N}$, respectively. $\mathbb{R}_{\geq 0}$ is the set of non-negative reals. For a given finite set $S$, $S^{*}$ denotes the set of all finite sequences taken from $S$. The indicator function $\mathds{1}_{S}(s)$ evaluates to $1$ when $s\in S$ and 0 otherwise. For a singleton set $\{s_0\}$, we will denote $\mathds{1}_{\{s_0\}}(s)$ with $\mathds{1}_{s_0}(s)$ for simplicity. The probability simplex over the set $S$ is denoted by $\Delta{S}$. For a string $s$, $|s|$ denotes the length of the string.


\subsection{Labeled POMDPs}

\paragraph{Model.} A Labeled Partially Observable Markov Decision Process (POMDP) is defined as a tuple $ \mathscr{M} = ({S},{A},P,\varpi,{O},Z,A P,L,r,T)$, where ${S}$ is a finite state space, ${A}$ is a finite action space, $P_t: {S} \times {A} \to \Delta {S}$ is the transition probability function at time $t$, such that $P_t(s,a;s')$ is the probability of transitioning from state $s$ to state $s'$ on taking action $a$, $\varpi \in \Delta {S} $ is the initial state distribution, ${O}$ is a finite observation space, $Z_t:{S} \to \Delta {O}$ is the observation probability function, such that $Z_t(s;o)$ is the probability of seeing observation $o$ in state $s$ at time $t$, $AP$ is a set of atomic propositions, e.g., indicating the truth value of the presence of an obstacle, goal, etc., $L: {S} \to 2^{AP}$ is a labeling function which indicates the set of atomic propositions which are true in each state, e.g., $L(s) = (a)$ indicates that only the atomic proposition $a$ is true in state $s$, $r_t: {S} \times {A} \to \mathbb{R}$ is a reward function, such that $r_t(s,a)$ is the reward obtained on taking action $a\in {A}$ in state $s \in {S}$.  $S_t, A_t , O_t$ denote the state, action, and observation at time $t$, respectively. We say that the system is time-invariant when the reward function $r_t$ and the transition and observation probability functions $P_t$ and $Z_t$ do not depend on time $t$. The POMDP runs for a random time horizon $T$. This random time may be determined exogenously (independently) of the POMDP or it may be a stopping time with respect to the information process $\{I_t: t \geq 0\}$.

\paragraph{Pure and Mixed Policies.}
At any given time $t$, the information available to the agent is the collection of all the observations $O_{0:t}$ and all the past actions $A_{0:t-1}$. We denote this information with $I_t = \{O_{0:t},A_{0:t-1}\}$. A \emph{control law} $\pi_t$ maps the information $I_t$ to an action in the action space $A$, i.e., $A_t = \pi_t(I_t)$. The collection of control laws $\pi := (\pi_{0},\pi_1,\dots)$ over the entire horizon is referred to as a \emph{policy}. We refer to such deterministic policies as pure policies and denote the set of all pure policies with $\mathcal{P}$. 

A mixed policy $\mu$ is a distribution on a finite collection of pure policies. Under a mixed policy $\mu$, the agent randomly selects a pure policy $\pi \in \mathcal{P}$ with probability $\mu(\pi)$ before the POMDP begins. The agent uses this randomly selected policy to select its actions during the course of the process. More formally, $\mu:\mathcal{P}\to [0,1]$ is a mapping. The support of the mixture $\mu$ is defined as
\begin{align}
    \supp(\mu):=\{\mu:\mu(\pi)\neq 0,\pi \in \mathcal{P}\}.
\end{align}
The set $\mathcal{M}_p$ of all mixed mappings is given by
\begin{align}
    \mathcal{M}_p:=\left\{\mu:|\supp(\mu)|<\infty ,\sum_{\pi \in \supp(\mu)}\mu(\pi)=1\right\}.
\end{align}
Clearly, the set $\mathcal{M}_p$ of mixed strategies is convex. %\pn{$\mathcal{M}_p$?}

% \paragraph{Stopping Time.}
% $T$ denotes a stopping time process with respect to a filtration $\mathcal{F}_t$, i.e., $\{T = t\} \in \mathcal{F}_t$.

% \begin{assumption}\label{finiteassump}
% The POMDP $\mathscr{M}$ is such that for every policy $\pi$, the expected value of the stopping time $T$ is finite, i.e.
% \begin{align}
%     \E_\pi^{\mathscr{M}}[T] < \infty ~~\forall \pi.
% \end{align}
% \end{assumption}


\begin{assumption}\label{finiteassumpstrong}
The POMDP $\mathscr{M}$ is such that for every pure policy $\pi$, the expected value of the stopping time $T$ is finite, i.e., 
\begin{align}
    \E_\pi^{\mathscr{M}}[T] < T_{\textsc{max}}^{\mathscr{M}} < \infty, ~~\forall \pi.
\end{align}
\end{assumption}
Assumption \ref{finiteassumpstrong} ensures that the stopping time $T$ is finite almost surely, i.e., $\Py_\mu^\mathscr{M}[T<\infty]=1$ and the total expected reward $\mathcal{R}^{\mathscr{M}}(\mu) < \infty $ for every policy $\mu$.


A \emph{run} ${\xi}$ of the POMDP is the sequence of  states and actions $(s_0, a_0)(s_1,a_1)\ldots (s_{T},a_{T})$. We consider both $T$ finite as well as $T= \infty$. The total expected reward associated with a policy $\mu$ is given by
\begin{align}
    \mathcal{R}^{\mathscr{M}}(\mu) &= \E_\mu^\mathscr{M}\left[\sum_{t=0}^T r_t(S_t,A_t)\right]\\
    &=\sum_{\pi \in \supp(\mu)}\left[\mu(\pi)\E_\pi^\mathscr{M}\left[\sum_{t=0}^T r_t(S_t,A_t)\right]\right].
\end{align}
Note the $\mathcal{R}^{\mathscr{M}}(\mu)$ is a linear function in $\mu$.


\subsection{Finite Linear Temporal Logic Specification}

We use $\ltlf$~\citep{de2013linear}, a temporal extension of propositional logic, to express complex task specifications.
This is a variant of linear temporal logic (LTL)~\citep{baier2008principles} interpreted over finite traces. 
Given a set $AP$ of atomic propositions, i.e., Boolean variables that have a unique truth value ($\mathsf{true}$ or $\mathsf{false}$) for a given system state, $\ltlf$ formulae are constructed inductively as  follows: 
%
\begin{equation*}
    \varphi := \mathsf{ true } \ | \ a \ | \ \neg  \varphi \ | \ \varphi_1 \wedge \varphi_2 \ | \ \textbf{X} \varphi \ | \ \varphi_1 \textbf{U} \varphi_2, 
\end{equation*}
%
where $a \in AP$, $\varphi$, $\varphi_1$, and $\varphi_2$ are LTL formulae, $\wedge$ and $\neg$ are the logic conjunction and negation,  and $\textbf{U}$ and $\textbf{X}$ are the \emph{until} and \emph{next} temporal operators. Additional temporal operators such as \emph{eventually} ($\textbf{F}$) and \emph{always} ($\textbf{G}$)  are derived as $\textbf{F} \varphi := \mathsf{ true } \textbf{U} \varphi$ and $\textbf{G} \varphi := \neg \textbf{F} \neg \varphi$. For example, $ \varphi = \textbf{F}a\wedge(\textbf{G}\neg b)$ expresses the specification that a state where atomic proposition $a$ holds true has to be 
\emph{eventually} reached by the end of the trajectory and states where atomic proposition $b$ hold true have to be \emph{always} avoided.

$\ltlf$ formulae are interpreted over finite-length words $w = w_0w_1 \ldots w_{last} \in {(2^{AP})}^{*}$, where each letter $w_i$ is a set of atomic propositions and $last = |w| - 1$ is the index of the last letter of the word $w$. 
Given a finite word $w$ and $\ltlf$ formula $\varphi$, we inductively define when $\varphi$ is $\textit{true}$ for $w$ at step $i, (0 \leq i < |w|)$, written $w,i \models \varphi$, as follows:
\allowdisplaybreaks\begin{align*}
    w,i &\models \mathsf{ true },\\
    w,i &\models a \text{ iff  } a \in w_i,\\
    w,i &\models \varphi_1 \wedge \varphi_2 \text{ iff  } w,i \models \varphi_1 \text{ and } w,i \models \varphi_2,\\
    w,i &\models \neg \varphi \text{ iff  } w,i \not \models \varphi,\\
    w,i &\models \textbf{X} \varphi \text{ iff  } i+1 < |w| \text{ and } w,i+1  \models  \varphi, \\
    w,i &\models \varphi_1 \textbf{U}  \varphi_2\text{ iff  } \exists \ k \text{ s.t. } i \leq k < |w| \text{ and } w,k \models \varphi_2\\ \text{ and } & \forall j, \  i \leq  j  < k , \ w,j \models \varphi_1,\\
    w,i &\models \textbf{G} \varphi \text{ iff  } \forall j, \ i \leq j < |w| , \ w,j \models \varphi,\\
    w,i &\models \textbf{F} \varphi\text{ iff  } \exists \ j, \ i \leq j < |w| \text{ s.t. } w,j \models \varphi,
\end{align*}
where iff is shorthand for `if and only if.' A formula $\varphi$ is \textit{true} in $w$, denoted by $w \models \varphi $ iff $w,0 \models \varphi$. \\

Given a POMDP $ \mathscr{M}$ and an $\ltlf$ formula $\varphi$, a run $\xi = s_0,a_0,s_1,a_1\ldots s_T,a_T$ of the POMDP under policy $\mu$ is said to satisfy $\varphi$ if the
word $w = L(s_0)L(s_1)\ldots \in {(2^{AP})}^{T+1}$ generated by the run satisfies $\varphi$. The probability that a run of $\mathscr{M}$ satisfies $\varphi$ under policy $\mu$ is denoted by $\Py_{\mu}^{\mathscr{M}}(\varphi)$.

We refer the reader to the experimental Section \ref{exp} for various examples of $\ltlf$ specifications, especially ones expressing sequentiality, which cannot be expressed by standard reward functions.



\subsection{Deterministic Finite Automaton (DFA)}
The language defined by an $\ltlf$ formula, i.e., the set of words satisfying the formula, can be captured by a Deterministic Finite Automaton (DFA) ~\citep{zhu2017symbolic}. 

We denote a DFA by a tuple $\mathscr{A} = (Q, \Sigma, q_0, \delta, F)$, where $Q$ is a finite set of states, $\Sigma$ is a finite alphabet, $q_0 \in Q$ is an initial state, $\delta :
Q \times \Sigma   \to  {Q}$  is a transition function, and $F \subseteq Q$ is the set of accepting states.

A run $\xi_{\mathscr{A}}$ of $\mathscr{A}$ over a finite word $w = w_0\ldots w_n$, (with $w_i \in \Sigma$) is accepting if and only if there exists a sequence of states, $q_0q_1\ldots q_{n+1} \in Q^{n+1}$ such that $q_{i+1} = \delta(q_i,w_i), i = 0,\ldots,n$ and $q_{n+1} \in F$. A word $w \in \Sigma^{*}$ is accepted by $\mathcal{A}$ if and only if there exists an accepting run $\xi_{\mathscr{A}}$ of $\mathcal{A}$ on $w$.

Finally, we say that an $\ltlf$ formula is equivalent to a DFA $\mathscr{A}$ if and only if the language defined by the formula is the language accepted by $\mathscr{A}$. For any $\ltlf$ formula $\varphi$ over $AP$, we can construct an equivalent DFA with input alphabet $2^{AP}$ ~\citep{zhu2017symbolic}.





\section{Problem Formulation and Solution Strategy}

Given a labeled POMDP $\mathscr{M}$ and an $\ltlf$ specification $\varphi$, our objective is to design a policy $\mu$ that maximizes the total expected reward $\mathcal{R}^{\mathscr{M}}(\mu)$ while ensuring that the probability $\Py^{\mathscr{M}}_\mu(\varphi)$ of satisfying the specification $\varphi$ is at least $1-\delta$. More formally, we would like to solve the following constrained optimization problem
\begin{equation}\tag{P1} \label{probform}
    \begin{aligned}
   \textbf{LTL$_f$-POMDP:}~~~\underset{\mu}{\text{ max }} \quad & {\mathcal R}^{\mathscr{M}}(\mu)\\ \mathrm{s.t.} \quad  & \Py_{\mu}^{\mathscr{M}}(\varphi) \geq  1-\delta.
\end{aligned}
\end{equation}
If \eqref{probform} is feasible, then we denote its optimal value with $\mathcal{R}^*$. If \eqref{probform} is infeasible, then $\mathcal{R}^* = -\infty$.

\subsection{Constrained Product POMDP}

Given the labeled POMDP $ \mathscr{M}$ and a DFA $\mathscr{A}$ capturing the $\ltlf$ formula $\varphi$,
we follow the construction by~\citep{ding2013strategic} for MDPs to construct a constrained product POMDP $\mathscr{M} ^{\times} = (S^{\times},  A^{\times},P^{\times},s_{0}^{\times},r^{\times},r^f,\varpi,{O},Z^\times)$  which incorporates the transitions of $ \mathscr{M}$  and $ \mathscr{A}$, the observations and the reward function of $\mathscr{M}$, and the acceptance set of $\mathscr{A}$.

In the constrained product POMDP $\mathscr{M} ^{\times}$, $S^{\times} = ({S} \times Q)$ is the set of states, $A^{\times} = {A}$ is the action set, and $s_{0}^{\times} = (s_0,q_0)$ is the initial state, where $s_0$ is drawn from the distribution $\varpi$ and $q_0$ is the initial state of the DFA. For each $s,s'\in S, q,q' \in Q$, and $a \in A$, we define the transition function $P^{\times}_t((s,q),a;(s',q'))$ at time $t$ as
%
\begin{equation} \label{eq:prodtrans}
\begin{aligned}
& \begin{cases} P_t(s,a;s'), &\mbox{if } q' = \delta(q,L(s)), \\
0, & \text{otherwise.}
\end{cases}\\
\end{aligned}    
\end{equation}
The reward functions are defined as 
\begin{align}
    r^{\times}_{t}((s,q),a) &= {r}_t(s,a),~ \forall s,q,a,\label{prodrewdef}\\
    \label{finalrewdef}r^f((s,q)) &=
    \begin{cases}
    1,~ &\text{if }q \in F\\
    0,~ &\text{otherwise}.
    \end{cases}
\end{align}
%
% \begin{equation} \label{eq:prodrew}
% \begin{aligned}
% r^{\times}((s,q),a) & = \begin{cases} {r}(s,a), &\mbox{if } s \neq \aleph,\\
% 0, & \text{otherwise.}
% \end{cases}\\
% \end{aligned}    
% \end{equation}
The observation space ${O}$ is the same as in the original POMDP $\mathscr{M}$. The observation probability function $Z^\times((s,q);o)$ is defined as $Z(s;o)$ for every $s\in S,q\in Q,o\in O$. We denote the state of the product POMDP $\mathscr{M}^\times$ at time $t$ with $X_t = (S_t,Q_t)$ in order to avoid confusion with the state $S_t$ of the original POMDP $\mathscr{M}$.

At any given time $t$, the information available to the agent is $I_t = \{O_{0:t},A_{0:t-1}\}$. Control laws and policies in the product POMDP are the same as in the original POMDP $\mathscr{M}$. We define two reward functions in the product POMDP: (i) a reward $\mathcal{R}^{\mathscr{M}^\times}(\mu)$ associated with the original POMDP $\mathscr{M}$, and (ii) a reward $\mathcal{R}^{f}(\mu)$ associated with reaching an accepting state in the DFA $\mathscr{A}$.
The reward $\mathcal{R}^{\mathscr{M}^\times}(\mu)$ is defined as
\begin{align}
    \mathcal{R}^{\mathscr{M}^\times}(\mu) = \E_\mu\left[\sum_{t=0}^T r_t^\times(X_t,A_t)\right].
\end{align}
The reward $\mathcal{R}^{f}(\mu)$ is defined as
\begin{align}
    \mathcal{R}^{f}(\mu) = \E_\mu\left[ r^f(X_{T+1})\right].
\end{align}
Due to Assumption \ref{finiteassumpstrong}, the stopping time $T$ is finite almost surely, and therefore, the reward $\mathcal{R}^{f}(\mu)$ is well-defined.

In the constrained product POMDP, we are interested in solving the following constrained optimization problem
\begin{equation} \tag{P2}\label{prodprobform}
    \begin{aligned}
   \textbf{C-POMDP:}~~~\underset{\mu}{\text{ max }} \quad & {\mathcal R}^{\mathscr{M}^\times}(\mu)\\ \mathrm{s.t.} \quad  & \mathcal{R}^{f}(\mu) \geq  1-\delta.
\end{aligned}
\end{equation}
\begin{theorem}[Equivalence of Problems \eqref{probform} and \eqref{prodprobform}]\label{equiv1}
For any policy $\mu$, we have
\begin{align}
    \mathcal{R}^{\mathscr{M}^\times}(\mu) &= \mathcal{R}^{\mathscr{M}}(\mu)\\
    \mathcal{R}^{f}(\mu) &= \Py_{\mu}^{\mathscr{M}}(\varphi).
\end{align}
Therefore, a policy $\mu^*$ is an optimal solution in Problem \eqref{probform} if and only if it is an optimal solution to Problem \eqref{prodprobform}.
\end{theorem}
\begin{proof}
See Appendix \ref{equiv1proof}.
\end{proof}



\section{A No-regret Learning Approach for Solving the Constrained POMDP}
Problem \eqref{prodprobform} is a POMDP policy optimization problem with constraints. Solving unconstrained optimization problems is generally easier than solving constrained optimization problems. In this section, we describe a general methodology that reduces the constrained POMDP optimization problem \eqref{prodprobform} to a series of unconstrained POMDP problems. These unconstrained solvers can be solved using any off-the-shelf solver. The main idea is to first transform Problem \eqref{prodprobform} into a max-min problem using the Lagrangian function. This max-min problem can then be solved approximately using a no-regret algorithm such as the exponentiated gradient (EG) algorithm.

The Lagrangian function associated with Problem \eqref{prodprobform} is
\begin{align}
    L(\mu,\lambda)=\mathcal{R}^{\mathscr{M}^\times}(\mu) + \lambda (\mathcal{R}^{f}(\mu)-1+\delta).
\end{align}
Let
\begin{align}
    l^*:=\sup_{\mu}\inf_{\lambda\geq 0} L(\mu,\lambda)\tag{P3}\label{supinf}.
\end{align}
The constrained optimization problem in \eqref{prodprobform} is equivalent to the sup-inf optimization problem above \citep{boyd2004convex}. That is, if an optimal solution $\mu^*$ exists in problem \eqref{prodprobform}, then $\mu^*$ is a maximizer in \eqref{supinf}, and if $\eqref{prodprobform}$ is infeasible, then $l^*=-\infty$. Further, the optimal value of Problem \eqref{prodprobform} is equal to $l^*$. Consider the following variant of \eqref{supinf} wherein the Lagrange multiplier $\lambda$ is bounded:
\begin{align}
    l^*_B:=\sup_{\mu}\inf_{0\leq\lambda\leq B} L(\mu,\lambda)\tag{P4}\label{bsupinf}.
\end{align}
\begin{lemma}\label{epsopt}
Let $\bar{\mu}$ be an $\epsilon$-optimal strategy in sup-inf problem \eqref{bsupinf}, i.e.,
\begin{align}
    l^*_B \leq \inf_{0\leq \lambda \leq B}L(\bar{\mu},\lambda) + \epsilon,\label{epsoptcond}
\end{align}
for some $\epsilon>0$. Then, we have
\begin{align}
    \mathcal{R}^{\mathscr{M}^\times}(\bar{\mu}) &\geq \mathcal{R}^*-\epsilon,~~\text{and}\label{rewsat}\\
   \mathcal{R}^{f}(\bar{\mu})&\geq 1-\delta - \epsilon^f,\label{consat}
\end{align}
where $\epsilon^f = \frac{R_m-\mathcal{R}^* +\epsilon}{B}$
% \begin{align}
%     \epsilon^f = \frac{R_m-\mathcal{R}^* +\epsilon}{B},
% \end{align}
and $R_m:= \sup_\mu \mathcal{R}^{\mathscr{M}^\times}({\mu})$ is the maximum achievable reward.
\end{lemma}
\begin{proof}
See Appendix \ref{epsoptproof}.
\end{proof}
Lemma \ref{epsopt} suggests that if we can find an $\epsilon$-optimal mixed policy $\bar{\mu}$ of the sup-inf problem \eqref{supinf}, then the policy $\bar{\mu}$ is approximately optimal and satisfies the constraint approximately with respect to \eqref{prodprobform}, and therefore, Problem \eqref{probform} due to Theorem \ref{equiv1}. 

We use the exponentiated gradient (EG) algorithm to find an $\epsilon$-approximate policy $\bar{\mu}$ for Problem \eqref{bsupinf}.
Let $f(\lambda) = \sup_\mu L(\lambda,\mu)$.
% \begin{align}
%     f(\lambda) = \sup_\mu L(\lambda,\mu).
% \end{align}
A sub-gradient of the function $f(\cdot)$ at $\lambda$ is given by ($\mathcal{R}^f(\mu_\lambda)-1+\delta$), where
\begin{align}
    \mu_\lambda =  \arg\sup_\mu L(\mu,\lambda).\label{mulam}
\end{align}
\begin{remark}
For solving an unconstrained POMDP, it is sufficient to consider pure strategies, and therefore, most solvers optimize only over the space of pure strategies. Thus, the support of $\mu_\lambda$ is 1 for every $\lambda$.
\end{remark}

The EG algorithm uses this sub-gradient to iteratively update $\lambda$. The value of $\lambda$ at the $k$-th iteration is denoted by $\lambda_k$ and the corresponding maximizing policy $\mu_{\lambda_k}$ is simply denoted by $\mu_k$. The EG algorithm is described in detail in Algorithm \ref{algjr}. Computing the sub-gradient involves two key steps: solving the unconstrained POMDP in \eqref{mulam} and evaluating the constraint $\mathcal{R}^f(\mu)$. The algorithm does not depend on which methods are used for solving the unconstrained POMDP and evaluating the constraint.


The following theorem states that the average policy $\bar{\mu}$ obtained from Algorithm \ref{algjr} is an $\epsilon$-optimal policy for Problem \eqref{binfsup}.



\begin{algorithm}[tb]
  \caption{Exponentiated Gradient Algorithm}
  \label{algjr}
\begin{algorithmic}
%   \STATE {\bfseries Input:} data $x_i$, size $m$
%   \REPEAT
    \STATE Input: Constrained product POMDP $\mathscr{M} ^{\times}$
    \STATE Initialize $\lambda_1 = B/2$
  \FOR{$k=1,\dots,K$}
\STATE $\mu_k \leftarrow \textsc{opt}(\mathscr{M} ^{\times},\lambda_k) = \arg\sup_\mu L(\mu,\lambda_k)$
\STATE $\hat{p}_k \leftarrow \textsc{eval}(\mu_k) = \mathcal{R}^{f}(\mu_k)$
\STATE $\lambda_{k+1}= B\frac{\lambda_k e^{-\eta(\hat{p}_k-1+\delta)}}{B+\lambda_k(e^{-\eta(\hat{p}_k-1+\delta)}-1)}$
  \ENDFOR
%   \UNTIL{$noChange$ is $true$}
\STATE Output: $\bar{\mu}=\frac{\sum_{k=1}^K\mu_k}{K}$, $\bar{\lambda}=\frac{\sum_{k=1}^K\lambda_k}{K}$
\end{algorithmic}
\end{algorithm}

\begin{theorem}\label{lagrangethm}
Under Assumption \ref{finiteassumpstrong} and if $\eta = \sqrt{\frac{\log 2}{2KB^2}}$, the strategy $\bar{\mu}$ returned by Algorithm \ref{algjr} satisfies
\begin{align}
    l^*_B \leq \inf_{0\leq \lambda \leq B}L(\bar{\mu},\lambda) + 2B\sqrt{2\log2/K}.
\end{align}
Therefore,
\begin{align}
    \mathcal{R}^{\mathscr{M}}(\bar{\mu}) &\geq \mathcal{R}^*-2B\sqrt{2\log2/K}\\
     \Py_{\bar{\mu}}^{\mathscr{M}}(\varphi)&\geq 1-\delta+ \frac{\mathcal{R}^*-R_m-2B\sqrt{2\log2/K}}{B}.
\end{align}
\end{theorem}
\begin{proof}
The proof of this theorem is a variation of the proof of the Von Neumann theorem in Section 8.3 of \citep{hazan2016introduction}. See Appendix \ref{lagrangethmproof} for details.
\end{proof}


In Theorem \ref{lagrangethm}, we implicitly assume that Algorithm \ref{algjr} has access to an exact unconstrained POMDP solver and a method for evaluating $\mathcal{R}^f(\mu)$ exactly. In practice, however, methods for solving POMDPs and evaluating policies are approximate. A similar result as in Theorem \ref{lagrangethm} can be obtained even with approximate solvers by using the arguments in Appendix D of \citep{kalagarla2021model}.

% \begin{remark}
% As per Algorithm \ref{algjr}, we may have to keep track of $K$ policies. This can at times be prohibitively large. In order to keep the support of our mixed policy small, we can obtain a \emph{basic feasible solution} of the following LP. This leads to a mixed policy whose support size is at most two.
% \begin{equation} \tag{BFS}\label{bfs}
%     \begin{aligned}
%   \underset{w}{\text{ max }} \quad & \sum_{k=1}^K w_k{\mathcal R}^{\mathcal{M}^\times}(\mu_k)\\ \mathrm{s.t.} \quad  & \sum_{k=1}^K w_k\mathcal{R}^{f}(\mu_k) \geq  1-\delta-o(1/\sqrt{K})\\
%   &\sum_{k=1}^K w_k \leq 1\\
%   & w_k \geq 0, \quad \forall k.
% \end{aligned}
% \end{equation}
% \end{remark}



\subsection{Fixed Stopping Time}
Consider the case when the horizon $T$ is a constant. With a slight abuse of notation, we denote this constant with $T$. In this case, Assumption \ref{finiteassumpstrong} is trivially true, and therefore, Theorem \ref{lagrangethm} holds. The Lagrangian function in this case is given by
\begin{align}
    &L(\mu,\lambda) \\
    &= \E_\mu\left[\left(\sum_{t=0}^T r_t^\times(X_t,A_t)\right)+\lambda(r^f(X_{T+1})-1+\delta) \right].\nonumber
\end{align}
Clearly, for a given $\lambda$, we can maximize $L(\mu,\lambda)$ over $\mu$ using a finite-horizon POMDP solver \citep{walraven2019point}. The resulting policy $\mu_\lambda$ is a pure policy (potentially time-varying) and selects actions based on the product POMDP's posterior belief where, for an instance, $x \in S^\times$, the posterior belief $b_t \in \Delta S^\times$ at time $t$, is defined as $b_t(x) = \Py[X_t^\times = x \mid I_t]$.
% \begin{align}
%     b_t(x) = \Py[X_t^\times = x \mid I_t].\label{beliefdef}
% \end{align}
The constraint $\mathcal{R}^f(\mu)$ for any policy $\mu$ can be evaluated by Monte-Carlo simulation. Therefore, with the help of a finite-horizon POMDP solver and the Monte-Carlo method for constraint evaluation, we can employ Algorithm \ref{algjr} to approximately solve Problem \eqref{probform}.


%\subsection{Constrained Belief State MDP and Occupation Measure Linear Program}
% For an instance $x \in S^\times$, define the posterior belief $b_t \in \Delta S^\times$ at time $t$ as
% \begin{align}
%     b_t(x) = \Py[X_t^\times = x \mid I_t].
% \end{align}
% The initial belief is given by $b_0((s,q)) = \mu_0(s)\mathds{1}_{q_0}(q)$ for $s \in {S}, q \in Q$.







\subsection{Geometrically-distributed Time Horizon}\label{geometric}
% \dk{Geometric horizon: potential issues and justification}
% \begin{enumerate}
%     \item Stopping model: Perform task until I come back. The arrival process can be modeled as a geometric process. Given the mean arrival time, geometric process has maximum entropy.
%     \item Accounts for a wide range of horizons instead of focusing on a particular horizon.
%     \item Light-tailed: Although technically unbounded, extremely long horizons can be ignored since the tail decays exponentially.
% \end{enumerate}


Let $\{E_t: t=0,1,2,...\}$ be a sequence of i.i.d. Bernoulli random variables with $\Py[E_0 = 1] = 1-\gamma$ ($\gamma<1$). Let the time-horizon $T$ be defined as
\begin{align}
    T = \min\{t: E_t = 1,t=0,1,\cdots\}.
\end{align}
This stopping time $T$ has a geometric distribution with probability mass function $(1-\gamma)\gamma^t$. The mean of this stopping time is $\gamma/(1-\gamma)$ for every policy, and therefore, it satisfies Assumption \eqref{finiteassumpstrong}.
%Given the mean value of the stopping time, the geometric distribution has maximum entropy and therefore, makes minimal assumptions. 
This type of stopping time is useful in situations where the process stops when an exogenous event occurs ($E_t=1$). The occurrence time of such exogenous events is typically modeled as a geometric (memoryless) distribution. 
We observe that, under this stopping model, it is possible that the process stops in just a few steps (or even one step). However, when $\gamma$ is close to 1, the probability that the process stops quickly is very small. Because of this property, this geometric stopping time can also be used to approximately model bounded horizon problems with a sufficiently large $\gamma$.

We now show that solving the unconstrained POMDP in \eqref{mulam} reduces to solving an equivalent discounted-reward POMDP. Discounted-reward POMDP solvers have been extensively studied and several implementations of them are readily available \citep{kurniawati2008sarsop,somani2013despot}. Therefore, we can use any off-the-shelf discounted POMDP solver for this stopping model.

Let $\mathscr{M}$ be any \emph{time-variant} POMDP  and let $\mathscr{A}$ be a DFA  capturing the $\ltlf$ formula $\varphi$.
\begin{lemma}\label{disc}
For a given $\lambda$, maximizing $L(\mu,\lambda)$ over $\mu$ under the geometric stopping criterion is equivalent to maximizing the following discounted reward
\begin{align}
    \E_\mu\left[\sum_{t=0}^\infty \gamma^{t}\left(r_t^\times(X_t,A_t)+ \frac{\lambda(1-\gamma)}{\gamma}\gamma^{t}r^f(X_{t})\right)\right].
\end{align}
\end{lemma}
\begin{proof}
See Appendix \ref{discproof}.
\end{proof}


For a given $\lambda$, we can therefore maximize $L(\mu,\lambda)$ over $\mu$ using an infinite-horizon discounted-reward POMDP solver \citep{kurniawati2008sarsop}. The resulting policy $\mu_\lambda$ is a pure stationary policy and selects actions based on the product POMDP's posterior belief. The discounted-solver and a Monte-Carlo estimator can be used in Algorithm \ref{algjr} to solve Problem \eqref{probform} when the stopping time is geometrically distributed.
% The product POMDP $\mathscr{M} ^{\times}$ that is obtained from $\mathscr{M}$ and $\mathscr{A}$ with the geometric stopping time is as follows.
% In the product MDP $\mathscr{M} ^{\times}$, $S^{\times} = ({S} \times Q\times \{0,1\})$ is the set of states, $A^{\times} = {A}$ is the action set, and $s_{0}^{\times} = (s_0,q_0,e)$ is the initial state where $s_0$ is drawn from the distribution $\varpi$ and $e$ is 0 independently with probability $\gamma$. For each $s,s'\in S, q,q' \in Q, e,e' \in \{0,1\}$ and $a \in A$, we define the transition function as follows
% %
% \begin{equation} \label{eq:prodtrans}
% \begin{aligned}
% &P^{\times}((s,q,e),a;(s',q',e'))\\=
% & \begin{cases} \gamma P_t(s,a;s'), &\mbox{if } q' = \delta(q,L(s)),e'=0, \\
%  (1-\gamma)P_t(s,a;s'), &\mbox{if } q' = \delta(q,L(s)),e'=1, \\
% 0, & \text{otherwise.}
% \end{cases}\\
% \end{aligned}    
% \end{equation}

% The reward functions $r^\times$ and $r^f$ are defined as 
% \begin{align}
%     r^{\times}_t((s,q,e),a) &= {r}_t(s,a)(1-e)\ \forall s,q,e,a\\
%     r^f((s,q,e)) &=
%     \begin{cases}
%     1 &\text{if }q \in F\\
%     0 &\text{otherwise}.
%     \end{cases}
% \end{align}



% %
% % \begin{equation} \label{eq:prodrew}
% % \begin{aligned}
% % r^{\times}((s,q),a) & = \begin{cases} {r}(s,a), &\mbox{if } s \neq \aleph,\\
% % 0, & \text{otherwise.}
% % \end{cases}\\
% % \end{aligned}    
% % \end{equation}
% The observation space ${O}$ is the same as in the original POMDP $\mathscr{M}$. The observation function $Z^\times((s,q,e);o)$ is defined as $Z(s;o)$ for every $s\in S,q\in Q,e\in \{0,1\},o\in O$. We denote the state of the product POMDP $\mathscr{M}^\times$ at time $t$ with $X_t = (S_t,Q_t,E_t)$ in order to avoid confusion with the state $S_t$ of the original POMDP $\mathscr{M}$. Even though the information of the agent is expanded to include $E_{0:t}$, the agent's strategy space is essentially the same as in the original POMDP $\mathscr{M}$. This is because until $T-1$, $E_t$ is always $0$. At $T$, $E_t$ changes to $1$ but since the reward $r^\times$ is $0$ when $E_t=1$, the agent's action does not matter at time $T$.

% At any given time $t$, the information available to the agent is $I_t = \{O_{0:t},A_{0:t-1},E_{0:t}\}$. 


\section{Experiments}\label{exp}
We consider a collection of gridworld problems in which an agent needs to maximize its reward while satisfying an $\ltlf$ specification. In all our experiments, we use the geometric stopping (discounted) setting described in Section \ref{geometric}. Our primary reason for focusing on geometric stopping is the availability of a wide range of infinite-horizon discounted-reward solvers. The focus of our experiments is to demonstrate how our approach of constructing the product POMDP and using Algorithm \ref{algjr} results in behaviors that maximize the reward and satisfy the $\ltlf$ specification. We would like to emphasize that our approach can be extended to any other stopping time model as long as it has an associated unconstrained solver and a reward estimator. The computational complexity of our approach is about $K$ (number of iterations in Algorithm \ref{algjr}) times the complexity of solving the unconstrained POMDP and evaluating the constraint. Therefore, the scalability of our algorithm largely depends on the scalability of the methods for solving and evaluating unconstrained POMDPs.

In all of our experiments, we use the SARSOP solver for finding an approximately optimal policy $\mu_k$ at iteration $k$ of Algorithn \ref{algjr}, and Monte-Carlo simulations to estimate the constraint function. Additional details on the hyper-parameters and runtime used in our experiments can be found in Appendix \ref{expdetails}. We further use the online tool LTL\textsubscript{f}2DFA \citep{francesco_fuggitti_2019_3888410} based on MONA \citep{monamanual2001} to generate an equivalent DFA for an $\ltlf$ formula.


\begin{table}
    \centering
    \caption{Reward and constraint performance of the policy $\bar{\mu}$ under various models and specifications.}\label{tab:data}
    \begin{tabular}{lccccc}
      \toprule % from booktabs package
      \bfseries Model & \bfseries Spec & $\mathcal{R}^{\mathscr{M}}(\bar{\mu})$ & $\mathcal{R}^f(\bar{\mu})$ & $1-\delta$ & $B$ \\
      \midrule % from booktabs package
      $\mathscr{M}_1$ & $\varphi_1$& $1.72$ & $0.75$ & $0.75$ & $5$\\ %1.1
      $\mathscr{M}_2$ & $\varphi_1$& $0.95$ & $0.70$ & $0.70$ & $8$\\ %1.2
      $\mathscr{M}_3$ &$\varphi_2$ & $0.83$ & $0.76$ & $0.75$ & $5$\\ %2
      $\mathscr{M}_4$ & $\varphi_3$& $0.80$ & $0.71$ & $0.70$ & $6$\\ %3
      $\mathscr{M}_5$  & $\varphi_4$& $0.83$ & $0.71$ & $0.70$ & $6$\\ %4
      $\mathscr{M}_6$  &$\varphi_5$ & $1.01$ & $0.79$ & $0.80$ & $10$\\ %5new
      $\mathscr{M}_7$  &$\varphi_6$ & $4.28$ & $0.82$ & $0.80$ & $25$\\ %6new
      $\mathscr{M}_8$  &$\varphi_1$ & $2.73$ & $0.81$ & $0.85$ & $20$\\ %Fa1
      $\mathscr{M}_9$  &$\varphi_4$ & $1.68$ & $0.81$ & $0.75$ & $10$\\ %sequence1 (previous)
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}



\subsection{Location Uncertainty}
In all the experiments in this subsection, the agent's transitions in the gridworld are stochastic. That is, if the agent decides to move in a certain direction, it moves in that direction with probability $0.95$ and,  with probability $0.05$, it moves one step with uniform probability in any direction that is not opposite to its intended direction. The agent also receives a noisy observation on where it is currently located. The observation is uniformly distributed among the locations neighboring the agent's current location. The default grid size is $4\times 4$ and the discount factor is $0.99$. The details on the reward structures can be found in Appendix \ref{expdetails}.



\paragraph{Reach-Avoid Tasks.}
In this problem, we are interested in reaching a goal state $a$ and always avoiding dangerous states $b$. This can be specified using $\ltlf$ as $ \varphi_1 = \textbf{F}a\wedge(\textbf{G}\neg b)$.
In this case, we consider a $4 \times 4$ grid (model $\mathscr{M}_1$ with a single obstacle $b$) and an $8\times 8$ grid (model $\mathscr{M}_2$ with two obstacles $b$). 

\paragraph{Ordered Tasks.}
In this problem, we are interested in reaching states $a, b$, and $c$ in a certain order. If we are interested in reaching $b$ after $a$, the corresponding specification is $\varphi_2=\textbf{F}(a \wedge \textbf{F}b)$. Similarly, if we want to visit $a$, $b$, and $c$ in that order, the specification is $\varphi_3=\textbf{F}(a \wedge \textbf{F}(b \wedge \textbf{F}c))$. Under the specification $\textbf{F}(a \wedge \textbf{F}b)$, it is possible that the agent visits $b$, then $a$, and then $b$. To ensure that a strict order is maintained, we can have the specification $\varphi_4 = \neg b \textbf{U} (a \wedge \textbf{F}b)$. These tasks were performed on models $\mathscr{M}_3,\mathscr{M}_4$, and $\mathscr{M}_5$ (see Appendix \ref{expdetails}). 

\paragraph{Reactive Tasks.}
In this problem, we consider a more complicated specification. There are four states of interest: $a,b,c$, and $d$. The agent must eventually reach $a$ or $b$. However, if it reaches $b$, then it must visit $c$ without visiting $d$. This can be expressed as $\varphi_5= \textbf{F}(a \vee b) \wedge \textbf{G}(b \to (\neg {d}  \textbf{U}c))$. This task was performed on model $\mathscr{M}_6$ (see Appendix \ref{expdetails}). 

Another task specification is the following: eventually reach $a$; if you visit $b$ immediately after reaching $a$, then eventually visit $c$; otherwise, visit $d$. This can be expressed as $\varphi_6 = \textbf{F}a \wedge \textbf{G}((a\textbf{X}b\to \textbf{F}c)\wedge (a\textbf{X}\neg b\to \textbf{F}d))$. This task was performed on model $\mathscr{M}_7$ (see Appendix \ref{expdetails}). 


\subsection{Predicate Uncertainty}
In all the experiments in this subsection, the agent's transitions in the gridworld are deterministic. That is, if the agent decides to move in a certain direction, it moves in that direction with probability $1$. The uncertainty is in the location of objects that the agent may have to reach or avoid. The agent receives observations that may convey some information about an object's locations. A detailed description of the observation model is provided in Appendix \ref{expdetails}. The grid size in these models is $4\times 4$ and the discount factor is $0.99$.

\paragraph{Reach-Avoid Tasks.} The reach avoid specification ($\varphi_1$) is the same as earlier. However, the agent does not know which location to avoid. The agent must therefore gather enough information to assess where the undesirable state is and act accordingly. This task was performed on model $\mathscr{M}_8$ (see Appendix \ref{expdetails}).

\paragraph{Ordered Tasks.}
The agent needs to visit state $a$ and $b$ strictly in that order. Therefore, the specification is $\varphi_4$. However, the agent does not know where $b$ is located. Once again, it must gather enough information and then traverse the grid accordingly. This task was performed on model $\mathscr{M}_9$ (see Appendix \ref{expdetails}).

For each model discussed above, we use Algorithm \ref{algjr} to generate a mixed policy $\bar{\mu}$. The corresponding reward $\mathcal{R}^\mathscr{M}(\bar{\mu})$ and the constraint $\mathcal{R}^f(\bar{\mu})$ (which is the same as the satisfaction probability $\Py_{\bar{\mu}}^{\mathscr{M}}(\varphi)$) are shown in Table \ref{tab:data}. The reward and the constrained have been estimated by running $200$ Monte-Carlo simulations. We observe that the probability of satisfying the constraint generally exceeds the required threshold. Occasionally, the constraint is violated, albeit only by a small margin. This is consistent with our result in Theorem \ref{lagrangethm}.
Since we cannot exactly compute the optimal feasible reward $\mathcal{R}^*$, it is difficult to assess how close our policy is to optimality. Nonetheless, we observe that the agent behaves in a manner that achieves high reward in all of these models. A more detailed discussion on this can be found in Appendix \ref{expdetails}.

\begin{figure}
     \centering
     \begin{subfigure}[b]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{images/tr}
         \caption{Top-right obstacle}
         \label{fig:y equals x}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.22\textwidth}
         \centering
         \includegraphics[width=\textwidth]{images/bl}
         \caption{Bottom-left obstacle}
         \label{fig:three sin x}
     \end{subfigure}
        \caption{Trajectories in model $\mathscr{M}_8$ and specification $\varphi_1$}
        \label{fa1pic}
\end{figure}

\subsection{Discussion}
In this section, we discuss the interplay between reward maximization, constraint satisfaction, and partial observability for executing the reach-avoid task in model $\mathscr{M}_8$. The state in this model comprises of two parts: (i) the agent's location and (ii) the object $b$'s location. The object can only be in the bottom-left corner or the top-right corner (see Figure \ref{fa1pic}). The agent receives high reward when it remains in the top-right corner, moderate reward in the bottom-left corner, and no reward everywhere else. Further, the agent does not know the obstacle's location a priori. If the agent gets close to the obstacle, it can detect the obstacle with some probability. The agent's detection capability is better when it is in the bottom-left region than when it is in the top-right region (see Appendix \ref{expdetails}).  

In order to balance the reward, constraint satisfaction and information acquisition, our agent acts as follows. It first heads towards the location $a$ (since it has to eventually visit it) via the bottom-left region without hitting the corner. Since the agent's detection capability is higher in the bottom-left region, it acquires information on where the object is located. After reaching $a$, it goes to the top-right corner if the object is \emph{not} located there and bottom-left corner otherwise. Some typical trajectories of the agent are shown in Figure \ref{fa1pic}.

Plot \ref{fig:my_label} depicts the performance of various policies $\mu_k$ generated while executing Algorithm \ref{algjr}. We can observe that, in the vast majority of iterations, the constraint is being satisfied. The Lagrange multiplier $\lambda_k$ decreases as long as the constraint is being satisfied. The Lagrange multiplier eventually becomes too small and the constraint is violated. This is when we observe a spike in the reward (see Figure  \ref{fig:my_label}). These spikes add to the average reward. Since the constraint violation is substantial, the Lagrange multiplier increases. We note that this iterative process ensures that constraint violation occurs rarely. Since we randomly pick a policy with uniform distribution, the average error probability is still close to the threshold (see Table \ref{tab:data}).


\begin{figure}
    \centering
    \includegraphics[scale=0.5]{Fa1_paper2.png}
    \caption{This plot depicts how the Lagrange multiplier $\lambda_k$, the reward $\mathcal{R}^\mathscr{M}(\mu_k)$ and the probability of satisfaction $\mathcal{R}^f(\mu_k)$ evolve with $k$ in Algorithm \ref{algjr} under model $\mathscr{M}_8$ with the reach-avoid specification $\varphi_1$. }
    \label{fig:my_label}
\end{figure}




\section{Conclusions}
In this paper, we provided a methodology for designing policies that maximize the total expected reward while ensuring that the probability of satisfying a linear temporal logic ($\ltlf$) specification is sufficiently high. By augmenting the system state with the state of the DFA associated with the $\ltlf$ specification, we constructed a constrained product POMDP. Solving this constrained product POMDP is equivalent to solving the original problem. We provided an alternative constrained POMDP solver based on the exponentiated gradient (EG) algorithm and derived approximation bounds for it. We identified two types of stopping time (fixed and geometric) for which we have readily available unconstrained POMDP solvers which can be used by our constrained POMDP solver. For geometric stopping time models, we computed near optimal policies that satisfy the $\ltlf$ specification with sufficiently high probability. We observed in our experiments that our approach results in policies that effectively balance information acquisition (exploration), reward maximization (exploitation), and  satisfaction of the specification, which is very difficult to achieve using classical POMDPs.

% \begin{figure*}
%     \centering
%     \begin{tikzpicture}[xscale=1.5]
%         \coordinate (origin);
%         \draw[->] (origin) -- +(1cm,0) node[below] {$x$};
%         \draw[->] (origin) -- +(0,1cm) node[left] {$y$};
%         \fill[gray] (45:1cm) circle[radius=.2cm];
%     \end{tikzpicture}
%     \caption{A Nice Filled Ellipse with a Pair of Coordinate Axes.}\label{fig:tikz}
% \end{figure*}




% \begin{figure}
%   \centering
%   \includegraphics[width=0.7\linewidth,page=3]{Eindhoven}
%   \caption{A View of a Nice City.}\label{fig:Eindhoven}
% \end{figure}



% \subsection{Tables}\label{sec:tables}
% Tables should go in the \texttt{table} environment and be centered therein.
% The caption should go above the table and be in title caps.
% For an example, see Table~\ref{tab:data}.



% \begin{contributions} % will be removed in pdf for initial submission,
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
    
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This research was supported in part by the National Science Foundation under Awards 1839842, 1846524, and 2139982, the Office of Naval Research under Award N00014-20-1-2258, and the Defense Advanced Research Projects Agency under Award HR00112010003.
\end{acknowledgements}

%\newpage
% \bibliographystyle{IEEEtran}
\bibliography{kalagarla_619}

\begin{comment}

\appendix

% NOTE: necessary when ptmx or no mathfont class option is given
\section{Proof of Theorem \ref{equiv1}}\label{equiv1proof}
For any policy $\mu$, we have
\begin{align}
    \mathcal{R}^{\mathscr{M}^\times}(\mu) &= \E_\mu\left[\sum_{t=0}^T r_t^\times(X_t,A_t)\right]\\
    &= \E_\mu\left[\sum_{t=0}^T r_t^\times((S_t,Q_t),A_t)\right]\\
    & \stackrel{a}{=} \E_\mu\left[\sum_{t=0}^T r_t(S_t,A_t)\right] = \mathcal{R}^{\mathscr{M}}(\mu).
\end{align}
Here, the equality in $(a)$ follows from the definition of $r_t^\times$ in \eqref{prodrewdef}. Further, using $\eqref{finalrewdef}$, we have 
\begin{align}
    r^f(X_{T+1}) &= r^f((S_{T+1},Q_{T+1}))  = \mathds{1}_F(Q_{T+1}).
\end{align}
Following the acceptance condition of the DFA $\mathscr{A}$ which is equivalent to the $\ltlf$ specification $\varphi$, a run $\xi$ of the POMDP satisfies $\varphi$ if and only if the word generated by the run satisfies the acceptance condition of the DFA  $\mathscr{A}$ i.e., it's run on $\mathscr{A}, \xi_{\mathscr{A}}$ ends in the acceptance set $F$. Hence,
\begin{align}
    \mathcal{R}^{f}(\mu) = \E_\mu\left[ r^f(X_{T+1})\right] = \Py_{\mu}^{\mathscr{M}}(\varphi).
\end{align}



\section{Proof of Lemma \ref{epsopt}}\label{epsoptproof}
We have
\begin{align}
    \mathcal{R}^* &= l^* \\
    &\leq l^*_B \\
    &\leq \inf_{0\leq \lambda \leq B}L(\bar{\mu},\lambda) + \epsilon\\
    &= \mathcal{R}^{\mathscr{M}^\times}(\bar{\mu}) + \inf_{0\leq \lambda \leq B}\lambda(\mathcal{R}^{f}(\bar{\mu})-1+\delta) + \epsilon.
\end{align}
There are two possible cases: (i) $\mathcal{R}^{f}(\bar{\mu})-1+\delta \geq 0$ and (ii) $\mathcal{R}^{f}(\bar{\mu})-1+\delta < 0$. 

If case (i) is true, then \eqref{consat} is trivially satisfied. Further, in this case,
\begin{align}
    \inf_{0\leq \lambda \leq B}\lambda(\mathcal{R}^{f}(\bar{\mu})-1+\delta) = 0.
\end{align}
Therefore, $\mathcal{R}^* \leq \mathcal{R}^{\mathscr{M}^\times}(\bar{\mu}) + \epsilon$ and hence, \eqref{rewsat} is satisfied.

If case (ii) is true, we have
\begin{align}
    \inf_{0\leq \lambda \leq B}\lambda(\mathcal{R}^{f}(\bar{\mu})-1+\delta) &= B(\mathcal{R}^{f}(\bar{\mu})-1+\delta)\\
    &<0.
\end{align}
Therefore, $\mathcal{R}^* \leq \mathcal{R}^{\mathscr{M}^\times}(\bar{\mu}) + \epsilon$ and hence, \eqref{rewsat} is satisfied.
Further, we have
\begin{align}
    B(\mathcal{R}^{f}(\bar{\mu})-1+\delta) &\geq \mathcal{R}^*- \mathcal{R}^{\mathscr{M}^\times}(\bar{\mu})-\epsilon\\
    &\geq \mathcal{R}^*- R_m-\epsilon.
\end{align}
The last inequality holds because $R_m$ is the maximum achievable reward. Hence, \eqref{consat} is satisfied.


\section{Proof of Theorem \ref{lagrangethm}}\label{lagrangethmproof}
Consider the dual of \eqref{bsupinf}. Let
\begin{align}
    u^*_B:=\inf_{0\leq\lambda\leq B}\sup_{\mu} L(\mu,\lambda)\tag{P5}\label{binfsup}.
\end{align}
We have
\begin{align}
    l^*_B &\stackrel{a}{\leq} u^*_B\\
     &= \inf_{0\leq\lambda\leq B}\sup_{\mu} L(\mu,\lambda)\\
    &\stackrel{}{\leq}  \sup_{\mu}L(\mu,\bar{\lambda})\\
    &\stackrel{b}{=} \frac{1}{K}\sum_{k=1}^KL(\mu_{\bar{\lambda}},{\lambda}_k)\\
    &\stackrel{c}{\leq} \frac{1}{K}\sum_{k=1}^KL(\mu_k,{\lambda}_k)\\
    &\stackrel{d}{\leq} \frac{1}{K}\inf_{0\leq\lambda\leq B}\sum_{k=1}^KL(\mu_k,{\lambda})+ 2B\sqrt{2\log2/K}\\
    &\stackrel{e}{=} \inf_{0\leq\lambda\leq B}L(\bar{\mu},{\lambda})+ 2B\sqrt{2\log2/K}.
\end{align}
The inequality in $(a)$ is because of weak duality \citep{boyd2004convex}. Equality in $(b)$ holds because of the bilinearity (affine) of $L(\cdot)$. The inequality in $(c)$ holds because $\mu_k$ is the maximizer associated with $\lambda_k$. Inequality $(d)$ follows from Corollary 5.7 in \citep{hazan2016introduction}. Equality in $(e)$ is again a consequence of bilinearity of $L(\cdot)$.

% \subsection{Tables}\label{sec:tables}
% Tables should go in the \texttt{table} environment and be centered therein.
% The caption should go above the table and be in title caps.
% For an example, see Table~\ref{tab:data}.


\section{Proof of Lemma \ref{disc}}\label{discproof}
The rewards $\mathcal{R}^{\mathscr{M}^\times}(\mu)$ and $\mathcal{R}^{f}(\mu)$ in the corresponding product POMDP are given by
\begin{align}
    \mathcal{R}^{\mathscr{M}^\times}(\mu) &= \E_\mu\left[\sum_{t=0}^T r_t^\times(X_t,A_t)\right]\\
    &= \E_\mu\left[\sum_{t=0}^\infty \gamma^{t}r_t^\times(X_t,A_t)\right]\\
    \mathcal{R}^{f}(\mu) &= \E_\mu\left[ r^f(X_{T+1})\right]\\
    &= (1-\gamma)\E_\mu\left[\sum_{t=0}^\infty \gamma^{t}r^f(X_{t+1})\right]\\
    &= \frac{(1-\gamma)}{\gamma}\E_\mu\left[\sum_{t=1}^\infty \gamma^{t}r^f(X_{t})\right].
\end{align}
Therefore, we have
\begin{align}
    &L(\mu,\lambda) \\
    &= \E_\mu\left[\sum_{t=0}^\infty \gamma^{t}\left(r_t^\times(X_t,A_t)+ \frac{\lambda(1-\gamma)}{\gamma}\gamma^{t}r^f(X_{t})\right)\right]\nonumber \\
    &\qquad-\frac{\lambda(1-\gamma)}{\gamma}\E[r^f(X_{0})]-{\lambda(1-\delta)}.\nonumber
\end{align}

\section{Additional Details on Experiments}\label{expdetails}

\subsection{Model Description}

In this subsection, we provide further details on the grid world POMDP models used in our experiments. The images corresponding to the various models indicate the state space and the labeling function, e.g, in Fig.~\ref{fig:model1}, we have $L[(1,2)] = \{b\}, L[(3,3)] = \{a\}$ and $L[(i,j)] = \{\}$ for all other grid locations $(i,j)$. In all models, the agent starts from the grid location $(0,0)$. Further, the reward for all actions is $0$ in all grid locations, unless specified otherwise. In the supplementary material, we also provide videos that capture some representative behaviors of the policies generated by Algorithm \ref{algjr}. We will discuss them in greater detail below.

\subsubsection{Location Uncertainty}

\paragraph{Reach-Avoid Tasks.}

In model $\mathscr{M}_1$, reward $r((0,3),a) = 2$ and $r((3,3),a) = 1$ for all actions $a$. We observe that the agent satisfies the reach-avoid constraint with high probability and ends up in the top-right corner where the reward is highest. A representative trajectory for this model can be found in the video \texttt{mu1\_1.mp4}.\\
In model $\mathscr{M}_2$, reward $r((1,6)) = 3, r((4,3),a) = 3$ and $r((7,7),a) = 1$ for all actions $a$. In this model, we observe two characteristic behaviors. The agent reaches the goal state $a$ and remains there (see video \texttt{mu2\_1.mp4}). This behavior ensures that the specification is met but the reward is relatively lower. The other behavior is that the agent goes towards the location $(4,3)$ and tries to remain there to obtain higher reward (see video \texttt{mu2\_2.mp4}). However, since the the obstacle is very close and the transitions are stochastic, it is prone to violating the constraint. Nonetheless, this violation is rare enough such that the overall satisfaction probability exceeds the desired threshold.

\begin{figure}[h]
     \centering
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case1.1_54_00000.png}
         \caption{Model $\mathscr{M}_1$}
         \label{fig:model1}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case1.2_18_00000.png}
         \caption{Model $\mathscr{M}_2$}
         \label{fig:model2}
     \end{subfigure}
        \caption{Reach-Avoid Tasks}
\end{figure}

\paragraph{Ordered Tasks.}

For models $\mathscr{M}_3, \mathscr{M}_4$ and $\mathscr{M}_5$, reward $r((3,3),a) = 1$ for all actions $a$. In model $\mathscr{M}_3$, the agent visits $a$ and then $b$ in that order most of the time (see video \texttt{mu3\_1.mp4}). Very rarely, the agent narrowly misses one of the goals due to the stochasticity in transitions and partial observability (see video \texttt{mu3\_2.mp4}).
In model $\mathscr{M}_4$, the agent is almost always successful in satisfying the constraint and maximizing the reward (see video \texttt{mu4\_1.mp4}). In model $\mathscr{M}_5$, we see both successes (see video \texttt{mu5\_1.mp4}) and failures (see video \texttt{mu5\_2.mp4}). But the the failure probability is within the threshold as suggested by Table \ref{tab:add_data}.


\begin{figure}[h]
     \centering
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case2_18_00000.png}
         \caption{Model $\mathscr{M}_3$}
         \label{fig:model3}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case3_5_00000.png}
         \caption{Model $\mathscr{M}_4$}
         \label{fig:model4}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case4_36_00000.png}
         \caption{Model $\mathscr{M}_5$}
         \label{fig:model5}
     \end{subfigure}
        \caption{Ordered Tasks}
\end{figure}

\paragraph{Reactive Tasks.}

In model $\mathscr{M}_6$, reward $r((3,0),a) = 1$ and $r((3,3),a) = 2$ for all actions $a$. In this case, the agent goes to $a$ and remains there. Thus, satisfying the constraint (see video \texttt{mu6\_1.mp4}). Occasionally, the agent also goes to state $b$ and remains there to obtain a large reward. However, this violates the constraint since if the agent ever visits $b$, it must eventually go to $c$ (see video \texttt{mu6\_2.mp4}). \\
In model $\mathscr{M}_7$, reward $r((3,0),a) = 5$ and $r((0,3),a) = 2$ for all actions $a$. In this model, the agent goes to $a$ and then to $b$ so that it can go to $c$. If it had not gone to $b$ immediately after reaching $a$, then it will be compelled to go to $d$. We observe that the agent consistently visits $b$ after $a$ (see video \texttt{mu7\_1.mp4}).

\begin{figure}[h]
     \centering
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case5_78_00000.png}
         \caption{Model $\mathscr{M}_6$}
         \label{fig:model6}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.18\textwidth}
         \centering
         \includegraphics[width=\textwidth]{case6_22_00000.png}
         \caption{Model $\mathscr{M}_7$}
         \label{fig:model7}
     \end{subfigure}
        \caption{Reactive Tasks}
\end{figure}

\subsubsection{Predicate Uncertainty}

In the experiments of this section, there are two possible locations for object $b$: $(3,0)$ and $(0,3)$. In both cases, whenever the agent is `far' away (Manhattan distance greater than 1) from the object $b$, it gets an observation `F' indicating that it is \emph{far} with probability $1$. When the object is at the bottom left and the agent is adjacent to it, the agent gets an observation `C' with probability $0.9$ indicating that the object is \emph{close}. But if object $b$ is at the top right and the agent is adjacent to it, the agent gets an observation `C' only with probability $0.1$. Therefore, the detection capability of the agent is stronger when the object is in the bottom-left location as opposed to when it is in the top-right location.

\paragraph{Reach-Avoid Tasks} 

In model $\mathscr{M}_8$, reward $r((3,0),a) = 2$ and $r((0,3),a) = 4$ for all actions $a$. In this model, generally, the agent first collects some information from the bottom-left, reaches $a$ and goes to the rewarding location that is not an obstacle (see videos \texttt{mu8\_1.mp4}, \texttt{mu8\_2.mp4}, \texttt{mu8\_3.mp4}). We see rare instances where the agent completely ignores the constraint and maximizes the reward (see video \texttt{mu8\_4.mp4}).

\begin{figure}[h]
     \centering
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{fa1_90_00000.png}
         \caption{Model $\mathscr{M}_8$ with obstacle at $(0,3)$}
         \label{fig:model8a}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{fa1_91_00000.png}
         \caption{Model $\mathscr{M}_8$ with obstacle at $(3,0)$}
         \label{fig:model8b}
     \end{subfigure}
        \caption{Reach-Avoid Tasks}
\end{figure}

\paragraph{Ordered Tasks} 

In model $\mathscr{M}_9$, reward $r((0,0),a) = 2$ for all actions $a$. In this model, we observe that the agent mostly succeeds in satisfying the constraint and maximizing the reward (see videos \texttt{mu9\_1.mp4} and \texttt{mu9\_2.mp4}).
\begin{figure}[H]
     \centering
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{seq_44_00000.png}
         \caption{Model $\mathscr{M}_9$ with obstacle at $(0,3)$}
         \label{fig:model9a}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.17\textwidth}
         \centering
         \includegraphics[width=\textwidth]{seq_45_00000.png}
         \caption{Model $\mathscr{M}_9$ with obstacle at $(3,0)$}
         \label{fig:model9b}
     \end{subfigure}
        \caption{Ordered Tasks}
\end{figure}
\subsection{Hyper-parameters and Runtimes}


The parameter $\delta$ in all the experiments was chosen in the following manner: i) We first solve a POMDP problem in which we are only interested in maximizing the probability of satisfaction of the $\ltlf$ constraint. Let this probability be denoted by $p_{max}$. The SARSOP solver gives concrete approximation bounds on its solution and therefore on our estimate of $p_{max}$. ii) Since any threshold $1-\delta$ larger than $p_{max}$ is infeasible, we choose a $\delta$ such that $1-\delta$ is around $0.9p_{max}$. The values $\eta$ and $B$ are hyperparameters in our experiments. The $\eta$ suggested by Theorem \ref{lagrangethm} in our paper is guaranteed to result in convergence, but in practice, slightly larger values of $\eta$ can lead to faster convergence.


In Table.~\ref{tab:add_data}, we provide additional hyper-parameters that were used in our experiments. The parameter $simu$ denotes the number of Monte-Carlo simulations that were used to estimate the constraint in each iteration. $T_{solve}$ is the total time (over $K$ iterations) spent in solving the unconstrained POMDP using the SARSOP solver \cite{kurniawati2008sarsop}. $T_{simu}$ is the total time spent in simulating policies generated by the SARSOP solver. $T_{total}$ is the overall computation time for that model.

Most of our models have a state size of $16$ ($4\times 4$). However, the runtime (see Table \ref{tab:add_data}) for these models is drastically different. This is because of two factors: (i) DFA size and (ii) complexity of the POMDP problem. The size of the DFA can be large for a complex task. This naturally scales up the state space of the product POMDP. SARSOP returns an alpha-vector policy \cite{kurniawati2008sarsop}. When the POMDP is complex, alpha-vector policy returned by SARSOP may have many alpha vectors. This would imply that whenever the agent has to make a decision, it needs to solve a fairly large maximization problem. This makes the simulations time-consuming.
\begin{table*}[b]
    \centering
    \caption{Performance Value and Hyper-parameters}\label{tab:add_data}
    \begin{tabular}{lccccccccccccc}
      \toprule % from booktabs package
      \bfseries Model & \bfseries Spec & $|S|$ & $|Q|$ & $\mathcal{R}^{\mathscr{M}}(\bar{\mu})$ & $\mathcal{R}^f(\bar{\mu})$ & $1-\delta$ & $B$ & $\eta$ & $K$ & $simu$ & $T_{solve}$ & $T_{simu}$ & $T_{total}$ \\
      \midrule % from booktabs package
      $\mathscr{M}_1$ & $\varphi_1$&16 &3 & $1.72$ & $0.75$ & $0.75$ & $5$ & $2$ & $100$ & $200$ & $142$ & $3518$ & $3661$\\ %1.1
      $\mathscr{M}_2$ & $\varphi_1$&64 &3 &$0.95$ & $0.70$ & $0.70$ & $8$ & $2$ & $50$ & $100$ & $17299$ & $7825$ & $25125$\\ %1.2
      $\mathscr{M}_3$ &$\varphi_2$ &16 &3 & $0.83$ & $0.76$ & $0.75$ & $5$ & $2$ & $100$ & $200$ & $158$ & $3614$ & $3773$\\ %2
      $\mathscr{M}_4$ & $\varphi_3$&16 &4 & $0.80$ & $0.71$ & $0.70$ & $6$ & $2$ & $100$ & $200$ & $1893$ & $14534$ & $16428$\\ %3
      $\mathscr{M}_5$  & $\varphi_4$&16 &4& $0.83$ & $0.71$ & $0.70$ & $6$ & $2$ & $100$ & $200$ & $368$ & $8440$ & $8809$\\ %4
      $\mathscr{M}_6$  &$\varphi_5$ &16 &4 & $1.01$ & $0.79$ & $0.80$ & $10$ & $2$ & $100$ & $200$ & $109$ & $718$ & $828$\\ %5new
      $\mathscr{M}_7$  &$\varphi_6$ &16 &10 & $4.28$ & $0.82$ & $0.80$ & $25$ & $2$ & $50$ & $100$ & $5865$ & $57833$ & $63699$\\ %6new
      $\mathscr{M}_8$  &$\varphi_1$ &32 &3 & $2.73$ & $0.81$ & $0.85$ & $20$ & $0.02$ & $100$ & $200$ & $370$ & $21676$ & $22046$\\ %Fa1
      $\mathscr{M}_9$  &$\varphi_4$ &32 & 4& $1.68$ & $0.81$ & $0.75$ & $10$ & $0.2$ & $100$ & $200$ & $973$ & $25618$ & $26591$\\ %sequence1 (previous)
      \bottomrule % from booktabs package
    \end{tabular}
\end{table*}
\end{comment}
\end{document}
