Keywords: Q-Learning, Finite Horizon MDPs, Gaussian Process, Chemical Process Control
TL;DR: Gaussian process Q-Learning algorithm for finite-horizon MDPs that uses Gaussian processes to approximate state-action value functions with theoretical analysis showing convergence guarantees for convex MDPs.
Abstract: Many real-world control and optimization problems require making decisions over a finite time horizon to maximize performance. This paper proposes a reinforcement learning framework that approximately solves the finite-horizon Markov Decision Process (MDP) by combining Gaussian Processes (GPs) with Q-learning. The method addresses two key challenges: the tractability of exact dynamic programming in continuous state-control spaces, and the need for sample-efficient state-action value function approximation in systems where data collection is expensive. Using GPs and backward induction, we construct state-action value function approximations that enable efficient policy learning with limited data. To handle the computational burden of GPs as data accumulate across iterations, we propose a subset selection mechanism that uses M-determinantal point processes to draw diverse, high-performing subsets. The proposed method is evaluated on a linear quadratic regulator problem and online optimization of a non-isothermal semi-batch reactor. Improved learning efficiency is shown relative to the use of Deep Q-networks and exact GPs built with all available data.
Publication Agreement Form: pdf
Submission Number: 203
Loading