\newpage
\section{Compatibility with On-Policy Learning}\label{app:cheetah}
%==========================================================================
\startpara{Results for HalfCheetah} 
\figref{fig:cheetah} shows the results of applying three different RL algorithms, DDPG~\citep{ddpg2016}, PPO~\citep{ppo2017}, and A2C~\citep{a2c2016}, to HalfCheetah environments.
The comparison between the proposed approach and all baselines using DDPG has already been discussed in \sectref{sec:exp}.
Since the QRM, CRM, and HRM baselines are not compatible with PPO and A2C, we only compare with the naive baseline here.

Comparing the results of the three RL algorithms, we observe that DDPG exhibits relatively higher variance than the others. This is likely due to its off-policy nature, relying heavily on a replay buffer and exploration driven by control noise. In our experiments, we used a replay buffer with a capacity of $10^6$ while sampling only $100$ experiences for each update, introducing significant randomness as most samples in the large replay buffer do not yield positive rewards. Exploration also adds to the randomness.
In contrast, PPO and A2C are on-policy algorithms, where updates depend solely on the current policy. These algorithms tend to maintain their behavior once the current policy achieves partial task completion. Additionally, PPO incorporates a stabilizing technique that helps reduce variance.

Comparing different reward functions, we find that the Naive baseline achieves comparable performance with the proposed reward functions in all HalfCheetah environments. However, as noted in \sectref{sec:exp}, it usually performs the worst in other domains. One possible explanation is that the HalfCheetah task has a unique structure, where each sub-goal requires moving forward by the same distance. The Naive reward function assigns a reward of 1 for each sub-goal, maintaining consistency in the learning process.

%-------------------------
\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{figures/cheetah.png}
    \caption{Results of applying various RL algorithms to HalfCheetah environments.}   
    \label{fig:cheetah}
\end{figure}
%-------------------------

\newpage
\section{Partial Rewards in Reward Machines}\label{app:pr rm}
This ablation study investigates the effect of incorporating partial rewards into Reward Machine (RM) structures along with the use of potential-based reward shaping. While RMs are theoretically capable of representing and utilizing partial rewards (e.g., in the Office World environment, the RM transitions through states $u_0$ [initial], $u_1, u_2$ [intermediate], and $u_3$ [goal], as depicted in Figure 2 (b)~\cite{icarte2022reward}, our empirical evaluation reveals that their inclusion does not consistently enhance performance and can lead to performance degradation.

To evaluate the impact of partial rewards on RM-based algorithms, we conducted experiments in deterministic environments: Office World and Taxi World. We define "Partial Reward Q-learning Reward Machine" (PR QRM) as the QRM algorithm variant that incorporates partial rewards. For consistency across algorithms, we similarly introduce PR CRM and PR HRM, denoting CRM and HRM variants also utilizing partial rewards. All baselines in this study are equipped with potential-based reward shaping. Across both Office World and Taxi World environments, all RM-based algorithms suffered a performance degradation when supplemented with partial rewards of $1$ for each intermediate step.

These findings suggest that the algorithms, particularly in their current configurations, may not be inherently designed to effectively leverage partial rewards in conjunction with potential reward shaping. One plausible explanation for the observed performance degradation is that, as discussed in~\cite{icarte2022reward}, potential reward shaping can assign positive rewards to actions that lead to undesirable "violation" states within the RM, potentially exacerbating the negative effects of partial rewards. Therefore, careful consideration and potentially algorithm modifications are necessary to effectively harness the benefits of partial rewards within Reward Machine frameworks, especially when integrated with reward shaping techniques. In contrast to these observations, our proposed algorithms are designed to effectively incorporate partial rewards across diverse environments without performance degradation, while ensuring both task completion and reward maximization.

%-------------------------
\begin{figure}[h]
    \centering
    \includegraphics[width=\columnwidth]{figures/ps_reward_ablation.png}
    \caption{Results in Office World and Taxi World for RM algorithms using partial rewards.}   
    \label{fig:PR RM}
\end{figure}
%-------------------------
