Keywords: Koopman operator theory, reinforcement learning, machine learning, control theory
TL;DR: We introduce Koopman operator methods into reinforcement learning algorithms and show that they achieve SOTA while still maintaining interpretability.
Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman (HJB) equation, are ubiquitous in reinforcement learning and control theory contexts due, in part, to their guaranteed convergence towards a system’s optimal value function. However, its application presents very intense limitations. This paper explores the connection between the data-driven Koopman operator and Bellman Markov Decision Processes, resulting in the development of two new reinforcement learning algorithms to alleviate these limitations. In particular, we focus on Koopman operator methods that reformulate a nonlinear system by lifting into a new coordinate system where the dynamics become linear, and where HJB-based methods are more tractable. These transformations enable the estimation, prediction, and control of strongly nonlinear dynamics. Viewing the Bellman equation as a controlled dynamical system, the Koopman operator is able to describe the expectation of the time evolution of the value function in the given systems via linear dynamics in the lifted coordinates. By parameterizing the Koopman operator with control actions and making an assumption about the feature space of the time evolution of the value function, we are able to construct a new “Koopman tensor” that facilitates the estimation of the optimal value function. Finally, a transformation of Bellman’s framework in terms of the Koopman tensor enables us to reformulate two max-entropy reinforcement learning algorithms: soft-value iteration and soft actor-critic (SAC). This framework is very flexible and can be used for deterministic or stochastic systems as well as for discrete or continuous-time dynamics. We show that these algorithms attain SOTA with respect to traditional neural network-based SAC and linear quadratic regulator baselines while retaining interpretability on 3 controlled dynamical systems: the Lorenz system, the fluid flow past a cylinder, and a double-well potential with non-isotropic stochastic forcing.
Submission Track: Original Research
Submission Number: 183
Loading