


%=================================================================================
\subsection{Reinforcement Learning} \label{sec:rl}

Consider an RL agent interacting with an environment modeled as an episodic \emph{Markov decision process} (MDP), where each learning episode terminates within a finite horizon $H$.
Formally, an MDP is denoted as a tuple $\cM=(S, s_0, A, T, R, \gamma, L)$ where $S$ is a set of states, $s_0 \in S$ is an initial state, $A$ is a set of actions, $T: S \times A \times S \to [0,1]$ is a probabilistic transition function, $R$ is a reward function, $\gamma \in [0,1]$ is a discount factor, and $L: S \to 2^{AP}$ is a labeling function with a set of atomic propositions $AP$. 
The reward function can be Markovian, denoted by $R: S \times A \times S \to \Rset$, 
or non-Markovian (i.e., history dependent), denoted by $R: (S \times A)^* \to \Rset$.
Both the transition function $T$ and the reward function $R$ are unknown to the agent.

At each timestep $t$, the agent selects an action $a_t$ given the current state $s_t$ and reward $r_t$. 
The environment transitions to a subsequent state $s_{t+1}$, determined by the probability distribution $T(\cdot | s_t, a_t)$, and yields a reward $r_{t+1}$.
A (memoryless) policy is defined as a mapping from states to probability distributions over actions,
denoted by $\pi: S \times A \to [0,1]$.
The agent seeks to learn an optimal policy that maximizes the expected return, 
represented by $\exp[\sum_{t=0}^{H-1} \gamma^t r_{t+1}]$.

 


%=================================================================================
\subsection{Co-Safe LTL Specifications} \label{sec:logic}

We utilize Linear Temporal Logic (LTL)~\cite{pnueli1981temporal}, which is a form of modal logic that augments propositional logic with temporal operators, to specify complex tasks for the robotic agent.
We focus on the syntactically co-safe LTL fragment, defined as follows. 
$$
\varphi := \alpha \ |\ \neg \alpha \ |\ \varphi_1 \land \varphi_2 \ |\ \varphi_1 \lor \varphi_2 \ |\ 
            \next \varphi \ |\ \varphi_1 \until \varphi_2 \ |\ \eventually \varphi
$$
where $\alpha \in AP$ is an atomic proposition, $\neg$ (negation), $\land$ (conjunction), and $\lor$ (disjunction) are Boolean operators, while $\next$ (next), $\until$ (until), and $\eventually$ (eventually) are temporal operators.  
Intuitively, $\next \varphi$ means that $\varphi$ has to hold in the next step; $\varphi_1 \until \varphi_2$ means that $\varphi_1$ has to hold at least until $\varphi_2$ becomes true; and $\eventually \varphi$ means that $\varphi$ becomes true at some time eventually. 
%
A co-safe LTL formula $\varphi$ can be converted into a DFA $\cA_\varphi$ accepting exactly the set of good prefixes for $\varphi$~\cite{kupferman2001model}. 
Formally, a DFA is denoted as a tuple $\cA_\varphi = (Q, q_0, Q_F, 2^{AP}, \delta)$, where
$Q$ is a finite set of states, $q_0$ is the initial state, $Q_F \subseteq Q$ is a set of accepting states, 
$2^{AP}$ is the alphabet, and $\delta: Q \times 2^{AP} \to Q$ is the transition function. 


\begin{examp} \label{eg:dfa}
Consider a robot aiming to complete a task in a gridworld (\figref{fig:grid}).
The task is to collect an \emph{orange} flag and a \emph{blue} flag (in any order) while avoiding the \emph{yellow} flag. 
We describe this task using a co-safe LTL formula 
$\varphi = (\neg y) \until ((o \land ((\neg y) \until b)) \lor (b \land ((\neg y) \until o)))$, 
where $o$, $b$ and $y$ represent collecting \emph{orange}, \emph{blue} and \emph{yellow} flags, respectively. 
\figref{fig:dfa} shows the corresponding DFA $\cA_\varphi$, which has five states including the initial state $q_0$ depicted with an incoming arrow, a trap state $q_3$ from which no transitions to other states exist, and the accepting state $Q_F =\{q_4\}$ depicted with double circle.  
A transition is enabled when its labelled Boolean formula holds. 
Starting from the initial state $q_0$, a path ending in the accepting state $q_4$ represents a good prefix of satisfying $\varphi$, indicating that the task has been successfully completed. 
\end{examp}


% ---------------------------------
\begin{figure}[t]
     \centering
     \begin{subfigure}{0.4\columnwidth}
         \includegraphics[width=\textwidth]{figures/flag_env.png}
         \caption{Gridworld}
         \label{fig:grid}
     \end{subfigure}
     \hfill
     \begin{subfigure}{0.55\columnwidth}
         \includegraphics[width=\textwidth]{figures/dfa.png}
         \caption{DFA $\cA_\varphi$}
         \label{fig:dfa}
     \end{subfigure}
\caption{Example gridworld and a DFA $\cA_\varphi$ for a co-safe LTL formula
$\varphi = (\neg y) \until ((o \land ((\neg y) \until b)) \lor (b \land ((\neg y) \until o)))$.}
\end{figure}
% -------------------------------




%=================================================================================
\subsection{Task Progression}\label{sec:task}

We adopt the notion of ``task progression'' introduced in~\cite{lacerda2019probabilistic} to measure the degree to which a robotic task defined by a co-safe LTL formula $\varphi$ is completed. 

Given a DFA $\cA_\varphi = (Q, q_0, Q_F, 2^{AP}, \delta)$, let $\Sucq \subseteq Q$ be the set of successors of state $q$, and $|\delta_{q,q'}| \in \{0, \dots, 2^{|AP|}\}$ denote the number of possible transitions from $q$ to $q'$. We write $q \to^* q'$ if there is a path from $q$ to $q'$, and $q \not \to^* q'$ if $q'$ is not reachable from $q$. 


The \emph{distance-to-acceptance function} $d_{\varphi}: Q \to \Rsetgeq$ is defined as:
\begin{equation}\label{eqn:distance}
    d_{\varphi}(q)= 
    \begin{cases}
        0 & \! \text {if } q \in Q_F \\ 
        \displaystyle \min_{q' \in \Sucq} d_{\varphi}(q') + h(q,q') 
            & \! \text {if } q \not \in Q_F \text {, } q \! \to^* Q_F \\ 
        |AP| \cdot |Q| & \! \text {otherwise }
    \end{cases} 
\end{equation}
where $h(q,q'):=\log_2 \left( \left\{ \frac{2^{|AP|}}{|\delta_{q,q'}|} \right\}\right)$ represents the difficulty of moving from $q$ to $q'$ in the DFA $\cA_\varphi$.

The \emph{progression function} $\rho_\varphi: Q \times Q \to \Rsetgeq$ between two states of $\cA_\varphi$ is defined as:
\begin{equation}\label{eqn:progression}
    \rho_\varphi (q, q') =   
    \begin{cases}
        \max \{0, d_\varphi(q) - d_\varphi(q') \} \!
            & \!\! \text{if} \, q' \in \Sucq \text{, } q' \! \not \to^* \! q \\
         0 \! & \!\! \text{otherwise }
    \end{cases}
\end{equation}
The first condition mandates $q' \not \to^* q$ to ensure that there is no cycle in the DFA with a non-zero progression value, which is crucial for the convergence of infinite sums of progression~\cite{lacerda2019probabilistic}. 


% \ingy{clarify issues with cycles}
% We convert LTL formula $\varphi$ into $\cA_{\varphi}$ constructing directed acyclic graph (DAG) except for self-loop using the tool in \cite{mona}. Hence, in practical application, there is no cycle.

\begin{examp}\label{eg:distance}
In the DFA $\cA_\varphi$ (\figref{fig:dfa}), 
the distance-to-acceptance values of the trap state $q_3$ and the accepting state $q_4$ 
is $d_\varphi(q_3)= 3 \times 5 = 15$ and $d_\varphi(q_4)=0$, respectively. 
Applying \eqnref{eqn:distance} recursively yields 
$d_\varphi(q_0)=2$, $d_\varphi(q_1)=1$, and $d_\varphi(q_2)=1$. 
The progression from the initial state $q_0$ to $q_1$ is 
$\rho_\varphi(q_0, q_1) = \max \{0, d_\varphi(q_0) - d_\varphi(q_1) \} = 1$, indicating that a positive task progression has been made.
% while $\rho_\varphi(q_0, q_3) = \max \{0, d_\varphi(q_0) - d_\varphi(q_3) \} = 0$, since moving to the trap state $q_3$ does not result in any task progression. 
\end{examp}















