
The objective of this work is to create reward functions that encourage an RL agent to achieve the best possible progression in accomplishing a task specified by a co-safe LTL formula $\varphi$.
To this end, we define a product MDP $\cM^\otimes$ that augments the environment MDP $\cM$ with information about the task specification $\varphi$. 

%=================================================================================
\startpara{Product MDP}
Given an episodic MDP $\cM=(S, s_0, A, T, R, \gamma, L)$ and a DFA $\cA_\varphi = (Q, q_0, Q_F, 2^{AP}, \delta)$,
the product MDP is defined as
$\cM^\otimes = \cM \otimes \cA_\varphi = (S^\otimes, s_0^\otimes, A, T^\otimes, R^\otimes, \gamma, AP, L^\otimes)$, where $S^\otimes = S \times Q$, $s_0^\otimes = \langle s_0, \delta(q_0, L(s_0))\rangle$,
$L^\otimes(\langle s,q \rangle) = L(s)$,
\begin{equation*}
    T^{\otimes}\left(\langle s, q\rangle, a,\langle s', q'\rangle\right) = 
    \begin{cases}
        T(s, a, s') & \text{if } q'=\delta(q, L(s'))\\
        0 & \text {otherwise. }
    \end{cases}    
\end{equation*} 
This work focuses on designing Markovian reward functions $R^\otimes: S^\otimes \times A \times S^\otimes \to \Rset$ for the product MDP $\cM^\otimes$, whose projection onto $\cM$ yields non-Markovian reward functions.
% The projected reward function $R$ is Markovian only if $|Q|=1$ (i.e., the DFA has one state only). 

In practice, the product MDP is built on-the-fly during learning.
At each timestep $t$, given the current state $\langle s_t, q_t\rangle$, an RL agent selects an action $a_t$ and transits to a successor state $\langle s_{t+1}, q_{t+1}\rangle$, 
where $s_{t+1}$ is given by the environment, sampling from the distribution $T(\cdot | s_t, a_t)$, 
and $q_{t+1}$ is derived from the DFA's transition function $\delta(q_t, L(s_{t+1}))$. 
The agent receives a reward $r_{t+1}$ determined by the reward function 
$R^\otimes\left(\langle s_t, q_t\rangle, a,\langle s_{t+1}, q_{t+1}\rangle\right)$.

An RL agent aims to learn an optimal policy that maximizes the expected return in the product MDP $\cM^\otimes$.
A learned memoryless policy for $\cM^\otimes$ equates to a finite-memory policy in the environment MDP $\cM$, 
denoted by $\pi: S \times Q \times A \to [0,1]$,
with the DFA states $Q$ delineating various modes.

%=================================================================================
\startpara{Task progression for a policy}
We define a partition of the state space of DFA $\cA_\varphi = (Q, q_0, Q_F, 2^{AP}, \delta)$ based on an ordering of distance-to-acceptance values. 
Let $B_0 = Q_F$ and
$B_i = \{q \in Q \setminus \bigcup_{j=0}^{i-1} B_j \ |\ d_{\varphi}(q) \text{ is minimal} \}$
for $i > 0$. 
The task progression for a policy $\pi$ of the product MDP, denoted by $b(\pi)$, is the lowest index of reachable partitioned sets $B_i$ from the initial state.
A value of $b(\pi)=0$ signifies the task has been successfully completed.
The best possible task progression across all feasible policies $\Pi$ in the product MDP is defined as $b^* = \min\{b(\pi) \,|\, \pi \in \Pi \}$.

\begin{examp}\label{eg:policies}
The state space of the DFA $\cA_\varphi$ shown in \figref{fig:dfa} can be partitioned into four sets: $B_0 = \{q_4\}$, $B_1 = \{q_1, q_2\}$, $B_2 = \{q_0\}$, and $B_3 = \{q_3\}$. 

Let $g_{i,j}$ denote a grid cell in row $i$ and column $j$ in the gridworld (\figref{fig:grid}).
The agent's initial location is $g_{8,5}$. 
Consider the following three candidate policies:
\begin{itemize}
    \item $\pi_1$: The agent takes 10 steps to collect the blue flag in $g_{2,1}$, avoiding the yellow flag, but fails to reach the orange flag within the 25-step episode timeout.
    \item $\pi_2$: The agent moves 16 steps to collect the orange flag and then moves 4 more steps to collect the blue flag in $g_{6,5}$. The task is completed. 
    \item $\pi_3$: The agent moves directly to the yellow flag in 5 steps. The task is failed and the episode ends. 
\end{itemize}
We have $b(\pi_1) = 1$ as DFA state $q_1 \in B_1$ is reached with policy $\pi_1$,
$b(\pi_2) = 0$ upon task completion, and 
$b(\pi_3) = 2$ due to a direct transition from initial state $q_0 \in B_2$ to trap state $q_3 \in B_3$. 
The best possible task progression across all policies is $b^* = b(\pi_2) = 0$. 
\end{examp}


%=================================================================================
\startpara{Problem} 
This work aims to solve the following problem:
Given an episodic MDP $\cM$ with unknown transition and reward functions, along with a DFA $\cA_\varphi$ representing a co-safe LTL task specification $\varphi$, the objective is to construct a Markovian reward function $R^\otimes$ for the product MDP $\cM^\otimes = \cM \otimes \cA_\varphi$. This reward function should be designed such that an optimal policy $\pi^*$, learned by an RL agent via maximizing the expected return, also achieves the best possible task progression, that is, $b^* = b(\pi^*)$.
