Abstract: Among the most canonical systems are linear time-invariant dynamics governed by differential equations and stochastic disturbances. Since in many application the true dynamics are not known, an interesting problem in this class of systems is learning to minimize a quadratic cost function when system matrices are unknown. This work initiates theoretical analysis of implementable reinforcement learning policies for balancing exploration versus exploitation in such systems. We present an online policy that learns the optimal control actions fast by carefully randomizing the parameter estimates to explore. More precisely, we establish performance guarantees for the presented policy showing that the regret grows as the square-root of time multiplied by the number of parameters. Implementation of the policy for a flight control task shows its efficacy. Further, we prove tight results that ensure stability under inexact system matrices and fully specify unavoidable performance degradations caused by a non-optimal policy. To obtain the results, we conduct a novel analysis for matrix perturbation, bound comparative ratios of stochastic integrals, and introduce the new method of policy differentiation. These technical novelties are expected to provide a useful cornerstone for similar continuous-time reinforcement learning problems.
TL;DR: This work studies online learning of optimal control for minimizing quadratic cost functions in continuous-time stochastic linear systems.