

To solve this problem, we design two reward functions that incentivize an RL agent to improve the task progression (cf. \sectref{sec:reward}), and develop an adaptive reward shaping approach that dynamically updates these reward functions during the learning process (cf. \sectref{sec:adapt}).

%=================================================================================
\subsection{Basic Reward Functions} \label{sec:reward}


\startpara{Progression reward function}
First, we propose a \emph{progression reward function} based on the task progression function defined in \eqnref{eqn:progression}, representing the degree of reduction in distance-to-acceptance values.

\begin{align} \label{eqn:rp}
    & \Rp \left(\langle s, q\rangle, a,\langle s', q'\rangle\right) 
     = \rho_\varphi (q, q') \nonumber \\
    & =   
    \begin{cases}
        \max \{0, d_\varphi(q) - d_\varphi(q') \} 
            & \text{if} \, q' \in \Sucq \text{, } q' \! \not \to^* \! q \\
         0 & \text{otherwise }
    \end{cases}
\end{align}

\begin{examp} \label{eg:rp}
Assuming a deterministic environment for the gridworld shown in \figref{fig:grid}, 
the MDP has a discount factor of $\gamma = 0.9$.
We calculate the expected returns for policies from \egref{eg:policies} using the progression reward function. 
$\Vp{\pi_1}(s_0^\otimes) = 0.9^9 \approx 0.39$,
$\Vp{\pi_2}(s_0^\otimes) = 0.9^{15} + 0.9^{19} \approx 0.34$, and 
$\Vp{\pi_3}(s_0^\otimes) = 0$. 
Among these policies, $\pi_1$ yields the highest expected return, yet it fails to achieve the best possible task progression, as $b(\pi_1) = 1 > b^*=0$.  
\end{examp}


%=================================================================================
\startpara{Hybrid reward function}
The progression reward function rewards only transitions that progress toward acceptance, without penalizing those that stay in the same DFA state. 
To address this issue, we define a \emph{hybrid reward function}: 
\begin{equation} \label{eqn:rh}
    \Rh \left(\langle s, q\rangle, a,\langle s', q'\rangle\right) =   
    \begin{cases}
         \eta \cdot - d_\varphi(q) & \text {if } q = q' \\
         (1-\eta) \cdot \rho_\varphi (q, q') & \text {otherwise }
    \end{cases}
\end{equation}
where $\eta \in [0,1]$ balances the trade-offs between penalties and progression rewards. 

\begin{examp} \label{eg:rh}
We calculate the expected returns of policies in \egref{eg:policies} using the hybrid reward function (with $\eta = 0.1$). 
$\Vh{\pi_1}(s_0^\otimes) \approx -1.15$,
$\Vh{\pi_2}(s_0^\otimes) \approx -1.33$, and 
$\Vh{\pi_3}(s_0^\otimes) \approx -0.69$.
Although $\pi_3$ yields the highest expected return, 
it falls short in the task progression with $b(\pi_3) = 2$.
Increasing $\eta$ emphasizes penalties without altering the optimal policy in this example. Conversely, reducing $\eta$ moves closer to the progression reward function, especially when $\eta=0$.
\end{examp}


%=================================================================================
\subsection{Adaptive Reward Shaping} \label{sec:adapt}

While reward functions defined in \sectref{sec:reward} motivate an RL agent to complete a task specified by a co-safe LTL formula, \egegref{eg:rp}{eg:rh} show that the learned optimal policies that maximize the expected return do not achieve the best possible task progression.
A potential reason is that the distance-to-acceptance function $d_\varphi$, as defined in \eqnref{eqn:distance}, may not precisely reflect the difficulty of activating desired DFA transitions within a specific environment.
To tackle this limitation, we develop an adaptive reward shaping approach that dynamically updates distance-to-acceptance values and reward functions during the learning process.

%=================================================================================
\startpara{Updating distance-to-acceptance values}
After every $N$ learning episodes, with $N$ being a hyperparameter, we evaluate the average success rate of task completion. An episode is deemed successful if it concludes in an accepting state of the DFA $\cA_\varphi$.
If the average success rate falls below a predefined threshold $\lambda$, we proceed to update the distance-to-acceptance values accordingly.

We derive initial values $d_\varphi^0(q)$ for each DFA state $q \in Q$ from \eqnref{eqn:distance}.
The distance-to-acceptance values for the $k$-th update round are calculated recursively as follows:
\begin{equation} \label{eqn:updateD}
        d_\varphi^{k}(q) =   
    \begin{cases}
         d_\varphi^{k-1}(q) + \theta & \text{if } q \in B_i, \forall i \ge b_k \\
         d_\varphi^{k-1}(q) & \text {otherwise }
    \end{cases}
\end{equation}
where $b_k$ is the task progression of the optimal policy learned after $k \cdot N$ episodes, and $\theta$ is a hyperparameter, also used later in \eqnref{eqn:rah}, requiring that $\theta > 1$.



\begin{examp} \label{eg:updateD}
We have $d^0_{\varphi}(q_0) = 2$, $d^0_{\varphi}(q_1) = d^0_{\varphi}(q_2) = 1$, 
$d^0_{\varphi}(q_3) = 15$, and $d^0_{\varphi}(q_4) = 0$ following \egref{eg:distance}. 
Suppose $\pi_1$ is the optimal policy learned after the first $N$ episodes and thus $b_1 = 1$. 
Let $\theta = 100$. For states in $B_1 \cup B_2 \cup B_3 = \{q_0, q_1, q_2, q_3\}$, We update their distance-to-acceptance values as follows: $d^1_{\varphi}(q_1) = d^1_{\varphi}(q_2) = 101$,
$d^1_{\varphi}(q_0) = 102$, and $d^1_{\varphi}(q_3) = 115$.
For state $q_4 \in B_0$, we retain its distance-to-acceptance value as $d^1_{\varphi}(q_4) = 0$.  
\end{examp} 

%=================================================================================

Note that \eqnref{eqn:updateD} does not alter the order of distance-to-acceptance values, so the DFA state partitions $\{B_i\}$ remain unchanged. 
We present two new reward functions that leverage the updated distance-to-acceptance values as follows.

%=================================================================================
\startpara{Adaptive progression reward function}
Given the updated distance-to-acceptance values $d^k_{\varphi}(q)$, we apply the progression function defined in \eqnref{eqn:progression} and obtain
\begin{equation}\label{eqn:updateP}
    \rho^k_\varphi (q, q') =   
    \begin{cases}
        \max \{0, d^k_\varphi(q) - d^k_\varphi(q') \} \!
            & \!\! \text{if} \, q' \in \Sucq \text{, } q' \! \not \to^* \! q \\
         0 \! & \!\! \text{otherwise }
    \end{cases}
\end{equation}
Then, we define an \emph{adaptive progression reward function} for the $k$-th round of updates as:
\begin{equation} \label{eqn:rap}
    \Rap{k} \left(\langle s, q\rangle, a,\langle s', q'\rangle\right) = 
        \max \{\rho^0_\varphi (q, q'), \rho^k_\varphi (q, q')\} 
\end{equation}
When $k=0$, the adaptive progression reward function $\Rap{0}$ coincides with the progression reward function $\Rp$ defined in \eqnref{eqn:rp}.


\begin{examp} \label{eg:rap}
Using the updated distance-to-acceptance values from \egref{eg:updateD}, we calculate the adaptive progression rewards $\Rap{1}$ for the first round of update. 
For instance, we have 
$\rho^1_\varphi (q_1, q_4) = \max \{0, d^1_\varphi(q_1) - d^1_\varphi(q_4) \} = 101$. 
Recall $\rho^0_\varphi(q_1, q_4)= 1$ from \egref{eg:distance}. 
Thus, 
\[
\Rap{1} \left(\langle g_{6,4}, q_1\rangle, \text{right},\langle g_{6,5}, q_4\rangle\right)
= \max\{1,101\}=101.
\]
The expected returns of policies in \egref{eg:policies} with $\Rap{1}$ are
$\Vap{1}{\pi_1}(s_0^\otimes) \approx 0.39$,
$\Vap{1}{\pi_2}(s_0^\otimes) \approx 13.85$, and 
$\Vap{1}{\pi_3}(s_0^\otimes) = 0$. 
Policy $\pi_2$ yields the highest expected return while completing the task (i.e., $b(\pi_2) = 0$).
\end{examp} 


%=================================================================================
\startpara{Adaptive hybrid reward function}
We define an \emph{adaptive hybrid reward function} for the $k$-th round of updates as:
\begin{align} \label{eqn:rah}
    & \Rah{k} \left(\langle s, q\rangle, a,\langle s', q'\rangle\right) = \nonumber \\
    &\begin{cases}
         \eta_k \cdot - d^k_\varphi(q) & \text { if } q = q' \\
         (1-\eta_k) \cdot \max \{\rho^0_\varphi (q, q'), \rho^k_\varphi (q, q')\} & \text { otherwise }
    \end{cases}
\end{align}
with $\eta_0 \in [0,1]$, and $\eta_k = \frac{\eta_{k-1}}{\theta}$ where $\theta$ is the same hyperparameter used in \eqnref{eqn:updateD}. 
We require $\theta>1$ to ensure that the weight value $\eta_k$ is reduced in each update round, avoiding undesired behavior from increased self-loop penalties.
At $k=0$, the adaptive hybrid reward function $\Rah{0}$ aligns with the hybrid reward function $\Rh$ as defined in \eqnref{eqn:rh}.




\begin{examp} \label{eg:rah}
Let $\eta_0 = 0.1$, and $\theta_1=100$. 
The initial distance-to-acceptance values $d^0_{\varphi}$ are the same as in \egref{eg:updateD}.
Suppose the agent's movement during the episodes follows a policy $\pi$ such that $b(\pi) = 1$.
Following \eqnref{eqn:updateD}, we update the distance-to-acceptance values of states in 
$B_1 \cup B_2 \cup B_3 = \{q_0, q_1, q_2, q_3\}$ to 
$d^1_{\varphi}(q_1) = d^1_{\varphi}(q_2) = 101$,
$d^1_{\varphi}(q_0) = 102$, and $d^1_{\varphi}(q_3) = 115$.
We compute $\Rah{1}$ with $\eta_1 = 0.001$, which yields
$\Vah{1}{\pi_1}(s_0^\otimes) \approx -0.52$,
$\Vah{1}{\pi_2}(s_0^\otimes) \approx 12.97
$,
and $\Vah{1}{\pi_3}(s_0^\otimes) \approx -0.35$.
The optimal policy $\pi_2$ not only yields the highest expected return but also completes the task with $b(\pi_2) = 0$.
\end{examp} 


% In summary, \egegref{eg:rap}{eg:rah} show that an optimal policy under which the task is completed could be learned within several updates of adaptive progression and adaptive hybrid reward functions. 

%=================================================================================
\startpara{Correctness}
The correctness of the proposed adaptive reward shaping approach, as it pertains to the problem formulated in \sectref{sec:problem}, is stated below, with the proof provided in \apref{app:proof}.

\begin{theorem}\label{thm:main}
Given an episodic MDP $\cM$ and a DFA $\cA_\varphi$ corresponding to a co-safe LTL formula $\varphi$, there exists an optimal policy $\pi^*$ of the product MDP $\cM^\otimes = \cM \otimes \cA_\varphi$ that maximizes the expected return based on a reward function $R^{\otimes} \in \{\Rap{k}, \Rah{k}\}$ for some $k \in \Nset$, where the task progression for policy $\pi^*$ matches the best possible task progression $b^*$ across all feasible policies in the product MDP $\cM^\otimes$, that is, $b^* = b(\pi^*)$. 
\end{theorem}



