
We evaluate the proposed adaptive reward shaping approach in a variety of benchmark RL domains. 
We describe the experimental setup including environments, RL algorithms, baselines, and evaluation metrics in \sectref{sec:setup}, and analyze the experimental results in \sectref{sec:results}. 

%=================================================================================
\subsection{Experimental Setup} \label{sec:setup}

%-------------------
\begin{figure*}[t]
    \centering
    \includegraphics[width=0.75\textwidth]{figures/normal.png}
    \caption{Results for deterministic environments.}
    \label{fig:normal}
\end{figure*}
%-------------------

%-------------------
\begin{figure*}[h]
    \centering
    \includegraphics[width=0.75\textwidth]{figures/noisy.png}
    \caption{Results for noisy environments.}
    \label{fig:noisy}
\end{figure*}
%-------------------

\startpara{Environments}
The following RL domains are used: the taxi domain from OpenAI Gym~\citep{brockman2016openai}, and three other domains adapted from~\cite{icarte2022reward}.
\begin{itemize}
    \item \emph{Office world}: The agent navigates a 12$\times$9 grid world to: get coffee and mail (in any order), deliver them to the office, and avoid obstacles. The test environment assigns a reward of 1 for each sub-goal: (i) get coffee, (ii) get coffee and mail, and (iii) deliver coffee and mail to the office, all while avoiding obstacles.
    \item \emph{Taxi world}: The agent drives around a 5$\times$5 grid world to pick up and drop off a passenger, starting from a random location. There are five possible pickup locations and four possible destinations. The task is completed when the passenger is dropped off at the target destination. The test environment assigns a reward of 1 for each sub-goal: (i) pick up the passenger, (ii) reach the target destination, and (iii) drop off the passenger. 
    \item \emph{Water world}: The agent moves in a continuous 2D box with six colored floating balls, changing velocity toward one of the four cardinal directions each step. The task is to touch red and green balls in strict order without touching other colors, then touch blue balls. The test environment assigns a reward of 1 for touching each target ball.
    \item \emph{HalfCheetah}: The agent is a cheetah-like robot with a continuous action space, controlling six joints to move. The task is completed by reaching the farthest location. The test environment assigns a reward of 1 for reaching each of the five locations along the way.
\end{itemize}
For each domain, we consider three types of environments: 
(1) \emph{deterministic} environments, where each state-action pair leads to a single success state only; 
(2) \emph{noisy} environments, where each action has a certain control noise; 
and (3) \emph{infeasible} environments, where some sub-goals are impossible to complete (e.g., a blocked office that the agent cannot access, or missing blue balls in the water world).


%=================================================================================
\startpara{Baselines}
We compare the proposed approach with the following methods as baselines: 
\emph{Q-learning for reward machines} (QRM) with reward shaping~\citep{camacho2019ltl},  \emph{counterfactual experiences for reward machines} (CRM) with reward shaping and \emph{hierarchical RL for reward machines} (HRM) with reward shaping~\citep{icarte2022reward}. We also evaluate RM-based algorithms incorporating partial rewards, which are detailed in Appendix~\ref{app:pr rm}. We use the code accompanying publications. 

Moreover, we consider a naive baseline that rewards transitions that decrease the distance to acceptance. For each transition $\left(\langle s, q\rangle, a, \langle s', q'\rangle\right)$ in the product MDP, assign a reward of 1 if $d_\varphi(q) > d_\varphi(q')$ and there is a path from $q$ to accepting states $Q_F$, otherwise assign a reward of 0.

% \begin{equation} \label{eqn:rn}
%     \Rn \left(\langle s, q\rangle, a,\langle s', q'\rangle\right) =
%     \begin{cases}
%         1 & \text{if } d_\varphi(q) > d_\varphi(q') \text {, } q \! \to^* Q_F \\
%         0 & \text{otherwise }
%     \end{cases}     
% \end{equation} 

%=================================================================================
\startpara{RL Algorithms} 
We use DQN~\cite{dqn2015} for learning in discrete domains (office world and taxi world), DDQN~\citep{ddqn2016} for water world with continuous state space, and DDPG~\citep{ddpg2016} for HalfCheetah with continuous action space. 
Note that QRM implementation does not work with DDPG, so we only use HRM and CRM as the baselines for HalfCheetah. 
We also apply PPO~\citep{ppo2017} and A2C~\citep{a2c2016} to HalfCheetah (QRM, CRM and HRM baselines are not compatible with these RL algorithms) and report results in \apref{app:cheetah} due to the page limit. 
Our implementation was built upon OpenAI Stable-Baselines3~\citep{stable-baselines3}. 


%=================================================================================
\startpara{Metrics}
We pause the learning process every 100 training steps in the office world and every 1,000 training steps in other domains, then evaluate the current policy in the test environment over 5 episodes.
We evaluate the performance using two metrics: \emph{success rate of task completion}, calculated by counting the frequency of successful episodes where the task is completed, and \emph{normalized expected return}, which is normalized using the maximum possible return for that task.
The only exception is taxi world, where the maximum return varies for different initial states and we normalize by averaging the maximum return of all initial states.



%=================================================================================
\subsection{Results Analysis} \label{sec:results}

%-------------------
\begin{figure*}[t]
    \centering
    \includegraphics[width=0.75\textwidth]{figures/missing.png}
    \caption{Results for infeasible environments.}
    \label{fig:missing}
\end{figure*}
%-------------------

We ran 10 independent trials for each method. 
\figfigfigref{fig:normal}{fig:noisy}{fig:missing} plot the mean performance with a $95\%$ confidence interval (the shaded area) in deterministic, noisy, and infeasible environments, respectively. 
The success rate of task completion is omitted in \figref{fig:missing} because it is zero for all trials (i.e., the task is infeasible to complete).

\startpara{Performance comparison}
These results show that the proposed approach using adaptive progression or adaptive hybrid reward functions generally outperforms baselines, achieving earlier convergence to policies with a higher success rate of task completion and a higher normalized expected return. 

The significant advantage of our approach is best illustrated in \figref{fig:missing}, where baselines fail to learn effectively in environments with infeasible tasks.
Although baselines apply potential-based reward shaping~\citep{ng1999policy} to assign intermediate rewards, they cannot distinguish between good and bad terminal states (e.g., completing a sub-goal and colliding with an obstacle have the same potential value).
In contrast, our approach provides more effective intermediate rewards, encouraging the agent to learn and maximize task progression.

The only outlier is the noisy office world where QRM and CRM outperform the proposed approach. 
One possible reason is that our approach gets stuck with a sub-optimal policy in this environment, which opts for fetching coffee at a closer location but results in a longer route to complete other sub-goals (i.e., getting mail and delivering to office).

Comparing the proposed reward functions, we observe that the adaptive hybrid reward function has the best overall performance. 
Comparing different RL environments, the proposed approach can achieve a success rate of 1 and the maximum possible return in most deterministic environments, but its performance is degraded in noisy environments due to control noise and in infeasible environments due to environmental constraints.

% ablation
\startpara{Ablation study}
Additionally, we conduct an ablation study to investigate the sensitivity of the hyperparameters $\theta$ and $N$ used for updating distance-to-acceptance values (cf. \sectref{sec:adapt}). 
\figref{fig:ablation1} shows the normalized reward for two infeasible environments: Taxi World and Water World. 
The results demonstrate that the proposed approach converges with a sufficiently large value of $\theta \in \{2,000, 5,000, 10,000\}$. Moreover, it takes more training steps to achieve policy convergence with larger values of $N$, indicating longer intervals between consecutive updates of reward functions.
\figref{fig:ablation2} shows the success rates for the feasible version of Taxi World and Water World. These results suggest that feasible environments benefit from longer update intervals $N$, as the agent has more time to explore and gather experience before the reward function is modified.

\startpara{Hyperparameter Selection: Practical Guidance}
We offer the following heuristics for selecting key hyperparameters in our framework. For the reward update interval $N$, a useful starting point is the total training budget divided by the number of distinct task stages (e.g., states in a task-governing DFA), as this aims to provide the agent with sufficient interaction episodes within each task stage before potential reward adjustments. The reward scaling factor $\theta$ can be initially set to the sum of progression rewards, $\sum_{q, q^{\prime}} p_\phi\left(q, q^{\prime}\right)$, which approximates the cumulative effort. Insights from the ablation study further suggest that task feasibility can guide these choices: potentially infeasible tasks may benefit from smaller $\theta$ values and more frequent updates (smaller $N$) to enable regular progress assessment and dynamic reward adjustment in challenging settings. In contrast, feasible tasks often accommodate larger $\theta$ and $N$ to allow for the collection of more meaningful data before reward function modification. When employing hybrid reward functions, we recommend small magnitudes for step-wise penalties (e.g., $10^{-3}$ or $10^{-4}$) to avoid overwhelming the positive shaping signals. These guidelines serve as practical starting points, though optimal settings can be task-dependent. 
%-------------------------
\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{figures/ablation_infeasible.png}
    \caption{Results of the ablation study on the sensitivity of hyperparameters $\theta$ and $N$ for updating distance-to-acceptance values in infeasible environments.} 
    \label{fig:ablation1}
\end{figure}
%-------------------------
%-------------------------
\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{figures/ablation_feasible.png}
    \caption{Results of the ablation study on the sensitivity of hyperparameters $\theta$ and $N$ for updating distance-to-acceptance values in feasible environments.} 
    \label{fig:ablation2}
\end{figure}
%-------------------------
