
\cite{li2017reinforcement} presents one of the first works applying temporal logic to reward function design, assigning reward functions based on robustness degrees of satisfying truncated LTL formulas. \cite{de2019foundations} uses a fragment of LTL for finite traces (called LTL$_f$) to encode RL rewards. Several methods seek to learn optimal policies that maximize the probability of satisfying an LTL formula~\cite{hasanbeig2019reinforcement, bozkurt2020control, hasanbeig2020deep}. However, these methods assign sparse rewards for task completion and do not provide intermediate rewards for task progression.


There is a line of work on \emph{reward machines} (RMs), a type of finite state machine that takes labels representing environment abstractions as input and outputs reward functions. \cite{camacho2019ltl} shows that LTL and other regular languages can be automatically translated into RMs via the construction of DFAs. \cite{icarte2022reward} describes a collection of RL methods that exploit the RM structure, including \emph{Q-learning for reward machines} (QRM), \emph{counterfactual experiences for reward machines} (CRM), and \emph{hierarchical RL for reward machines} (HRM). These methods are augmented with potential-based reward shaping~\cite{ng1999policy}, where a potential function over RM states is computed to assign intermediate rewards. We adopt these methods (with reward shaping) as baselines for comparison in our experiments. As we will show in \sectref{sec:exp}, our approach generally outperforms baselines, providing more effective design of intermediate rewards for task progression.


\cite{jothimurugan2019composable} proposes a new specification language that can be translated into reward functions and later applies it for compositional RL in \cite{jothimurugan2021compositional}. These methods use a task monitor to track the degree of specification satisfaction and assign intermediate rewards. However, they require users to encode atomic predicates into quantitative values for reward assignment. In contrast, our approach automatically assigns intermediate rewards using DFA states' distance to acceptance values, eliminating the need for user-provided functions.


\cite{jiang2021temporal} presents a reward shaping framework for average-reward learning in continuing tasks. Their method automatically translates a LTL formula encoding domain knowledge into a function that provides additional reward throughout the learning process. 
This work has a different problem setup and thus is not directly comparable with our approach. 


\cite{cai2023overcoming} proposes an approach that decomposes an LTL mission into sub-goal-reaching tasks solved in a distributed manner. The same authors also present a model-free RL method for minimally violating an infeasible LTL specification in \cite{cai2023learning}. Both works consider the assignment of intermediate rewards, but their definition of task progression requires additional information about the environment (e.g., geometric distance from each waypoint to the destination). In contrast, we define task progression based solely on the task specification, following~\cite{lacerda2019probabilistic}, which is a work on robotic planning with MDPs (but not RL).
