
In reinforcement learning (RL), an agent's behavior is guided by reward functions, which are often difficult to specify manually when representing complex tasks. 
Alternatively, an RL agent can infer the intended reward from demonstrations~\cite{ng2000algorithms}, trajectory comparisons~\cite{wirth2017survey}, or human instructions~\cite{fu2018language}.
Recent years have seen a surge of interest in using formal languages such as Linear Temporal Logic (LTL) and finite automata to specify complex tasks and derive reward functions for RL (see the extensive list of related work in \sectref{sec:related}).
Nevertheless, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise).
Sparse rewards may necessitate hundreds of thousands of exploratory episodes for convergence to a quality policy.
Furthermore, many prior works are only compatible with specific RL algorithms tailored to their proposed reward structures, such as Q-learning for reward machines~\cite{camacho2019ltl}, modular DDPG~\cite{hasanbeig2020deep}, and 
hierarchical RL for reward machines~\cite{icarte2022reward}.

\emph{Reward shaping}~\cite{ng1999policy} is a paradigm where an agent receives some intermediate rewards as it gets closer to the goal and has shown to be helpful for RL algorithms to converge more quickly. 
Inspired by this idea, we develop a logic-based adaptive reward shaping approach in this work. 
We use the syntactically co-safe fragment of LTL to specify complex RL tasks, such as 
``the task is to touch red and green balls in strict order without touching other colors, then touch blue balls''. 
We then translate a co-safe LTL task specification into a deterministic finite automaton (DFA) and design reward functions that keeps track of the task completion status (i.e., a task is completed if an accepting state of the DFA has been reached). 

The principle underlying our approach is to assign intermediate rewards to an agent as it makes progress toward completing a task. A key challenge is how to measure the closeness to task completion. 
We adopt the notion of \emph{task progression} defined by~\cite{lacerda2019probabilistic}, which measures each DFA state's distance to accepting states. 
The smaller the distance, the higher degree of task progression. 
The distance is zero when the task is fully completed. 

Another challenge is what reward values to assign for various degrees of task progression. 
To this end, we design two different reward functions. 
The \emph{progression} reward function assigns rewards based on the reduced distance-to-acceptance values. 
The \emph{hybrid} reward function balances the progression reward and the penalty for self-loops (i.e., staying in the same DFA state). 
However, we find that optimal policies maximizing the expected return based on these reward functions may not necessarily lead to the best possible task progression.

To address this limitation, we develop an adaptive reward shaping approach that dynamically updates distance-to-acceptance values to reflect the actual difficulty of activating DFA transitions during the learning process. 
We then design two new reward functions, namely \emph{adaptive progression} and \emph{adaptive hybrid}, leveraging the updated distance-to-acceptance values.
We show that our approach can learn an optimal policy with the highest expected return and the best task progression within a finite number of updates. 

Finally, we evaluate the proposed approach on various discrete and continuous RL environments. 
Computational experiments show the compatibility of our approach with a wide range of RL algorithms. 
Results indicate our approach generally outperforms baselines, achieving earlier convergence to a better policy with higher expected return and task completion rate.