Thompson Sampling for Partially Observable Linear-Quadratic Control

Taylan Kargin, Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, Babak Hassibi

Published: 01 Jan 2023, Last Modified: 02 Sept 2023ACC 2023Readers: Everyone

Abstract: Thompson Sampling (TS) is a popular method for decision-making under uncertainty, where an action is sampled from a carefully constructed distribution based on the data collected. In this work, we study the problem of adaptive control in partially observable linear quadratic Gaussian, i.e., LQG, control systems using TS, when the model dynamics are unknown. Prior works have established an $\tilde O(\sqrt T )$ regret upper bound for the adaptive control of such systems, after T time steps. However, the algorithms that achieve this result employ computationally intractable policies. We propose an efficient TS-based adaptive control algorithm, Thompson Sampling under Partial Observability TSPO, to effectively balance the exploration vs. exploitation trade-off and minimize the overall control cost in epochs. TSPO utilizes closed-loop system identification to estimate the underlying model parameters up to their confidence intervals. It then deploys the optimal policy of a sampled system, which is selected at random from the distribution constructed with the model estimates and their confidence intervals. We show <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">√</inf> that using only logarithmic policy updates, TSPO attains $\tilde O(\sqrt T )$ regret against the optimal control policy that knows the system dynamics. To the best of our knowledge, TSPO is the first computationally efficient algorithm that achieves $\tilde O(\sqrt T )$ regret in adaptive control of unknown partially observable LQG control systems with convex cost. Further, we empirically study the performance of TSPO in an adaptive measurement-feedback control problem.

0 Replies