TL;DR: We provide the first regret-free reinforcement learning algorithm for satisfying high-level temporal specifications.
Abstract: Learning to control an unknown dynamical system with respect to high-level temporal specifications is an important problem in control theory.
We present the first regret-free online algorithm for learning a controller for linear temporal logic (LTL) specifications for systems with unknown dynamics.
We assume that the underlying (unknown) dynamics is modeled by a finite-state and action Markov decision process (MDPs).
Our core technical result is a regret-free learning algorithm for infinite-horizon reach-avoid problems on MDPs.
For general LTL specifications, we show that the synthesis problem can be reduced to a reach-avoid problem once the graph structure is known.
Additionally, we provide an algorithm for learning the graph structure, assuming knowledge of a minimum transition probability, which operates independently of the main regret-free algorithm. Our LTL controller synthesis algorithm provides sharp bounds on how close we are to achieving optimal behavior after a finite number of learning episodes.
In contrast, previous algorithms for LTL synthesis only provide asymptotic guarantees, which give no insight into the transient performance during the learning phase.
Lay Summary: Many real-world systems, such as robots and autonomous vehicles, are expected to perform complex temporal tasks--such as reaching a goal while avoiding obstacles. These tasks can be described using temporal logic specifications. The standard approach is to use Reinforcement Learning (RL) such that a control policy is learned from the interactions of the agent with the environment, but existing RL methods only guarantee success after infinite time and do not give information on how close the agent is to the optimal behavior at a particular point of time.
This paper provides the guarantee that the RL agent learns to perform the temporal tasks with **regret** as a clear measure of progress, which is the gap between the actual and optimal success over time. We assume the underlying dynamics are modeled as a Markov decision process with a finite set of states and actions, and that only a lower bound over the minimum transition probability is known.
In short, we propose the first regret-free algorithm for policy synthesis with infinite-horizon temporal tasks. Our algorithm enables computing an upper bound on the number of learning episodes required for the learned policy to get close to the optimal one.
Primary Area: Theory->Online Learning and Bandits
Keywords: Markov decision process, regret-free algorithm, reinforcement learning, temporal logic tasks, formal specifications
Submission Number: 10544
Loading