Title: Distributionally Robust Linear Quadratic Control

Abstract: Linear-Quadratic-Gaussian (LQG) control is a fundamental control paradigm that has been studied and applied in various fields such as engineering, computer science, economics, and neuroscience. It involves controlling a system with linear dynamics and imperfect observations, subject to additive noise, with the goal of minimizing a quadratic cost function depending on the state and control variables. In this work, we consider a generalization of the discrete-time, finite-horizon LQG problem, where the noise distributions are unknown and belong to Wasserstein ambiguity sets centered at nominal (Gaussian) distributions. The objective is to minimize a worst-case cost across all distributions in the ambiguity set, including non-Gaussian distributions. Despite the added complexity, we prove that a control policy that is linear in the observations is optimal, as in the classic LQG problem. We propose a numerical solution method that efficiently characterizes this optimal control policy. Our method uses the Frank-Wolfe algorithm to identify the leastfavorable distributions within the Wasserstein ambiguity sets and computes the controller's optimal policy using Kalman filter estimation under these distributions.

Section: Introduction
The Linear Quadratic Regulator (LQR) and its stochastic counterpart, Linear-Quadratic-Gaussian (LQG) control, are foundational paradigms in control theory, finding widespread application across diverse fields such as engineering, computer science, economics, and neuroscience [3,12,29,47]. These methods provide optimal control policies for systems with linear dynamics, imperfect observations, and quadratic cost functions, under the critical assumption that additive noise terms are independent and normally distributed. In this classical setting, the optimal control policy is well-known to be linear in the observations and can be efficiently derived using Kalman filtering and dynamic programming [8]. However, real-world systems frequently operate under conditions where noise distributions are uncertain, non-Gaussian, or subject to adversarial perturbations, rendering the strict Gaussian assumption overly restrictive and potentially leading to suboptimal or brittle control performance.
Motivated by practical settings where noise distributions may not be readily available or may not be Gaussian, this paper considers a discrete-time, finite-horizon generalization of the LQG setting where noise distributions are unknown and are chosen adversarially from ambiguity sets characterized by a Wasserstein distance and centered around nominal (Gaussian) distributions.
We show that, even under distributional ambiguity, the optimal control policy remains linear in the system's observations. Our proof is novel and does not rely on traditional recursive dynamic programming arguments. Instead, we re-parametrize the control policy in terms of the purified state observations and we derive an upper bound for the resulting minimax formulation by relaxing the ambiguity set (from a Wasserstein ball into a Gelbrich ball) while simultaneously restricting the controller to linear dependencies. We then use convex duality to prove that this upper bound matches a lower bound obtained by restricting the ambiguity set in the dual of the minimax formulation. This implies the optimality of linear output feedback controllers, thus generalizing the classic results to a distributionally robust setting.
We also find that the worst-case distribution is actually Gaussian, which leads to a very efficient algorithm for finding optimal controllers. Specifically, we propose an algorithm based on the Frank-Wolfe first-order method that at every step solves sub-problems corresponding to classic LQG control problems, using Kalman filtering and dynamic programming. We show that this algorithm enjoys a sublinear convergence rate and is susceptible to parallelization. Lastly, we implement the algorithm leveraging PyTorch's automatic differentiation module and we find that it yields uniformly lower runtimes than a direct method (based on solving semidefinite programs) across all problem horizons.

Section: Literature Review
This paper is related to the ample literature in control theory and engineering aimed at designing controllers that are robust to noise. The classic LQR/LQG theory, developed in the 1960s, examined linear dynamical systems in either time or frequency domain, seeking to minimize a combination of quadratic state and control costs (in time-domain) or the H 2 norm of the system's transfer function (in frequency domain). Motivated by findings that LQG controllers do not provide the guaranteed robust stability properties of LQR controllers [15], much effort has been devoted subsequently to designing controllers that are robust to worst-case perturbations, typically evaluated in terms of the H ∞ norm of the system's transfer function (see, e.g., [16,53] for a comprehensive review of H ∞ and H 2 controllers). Because H ∞ controllers tend to be overly conservative [32], various approaches have been proposed to balance the performance of nominal and robust controllers, e.g., by combining H 2 and H ∞ approaches [7,17]. A parallel stream of literature has considered risk-sensitive control [51], which minimizes an entropic risk measure instead of the expected quadratic cost. Although risksensitive control has a distributionally robust flavor (as the entropic risk measure is equivalent to a distributionally robust quadratic objective penalized via Kullback-Leibler divergence), risk-sensitive control models do not admit a distributionally robust formulation because the entropic risk measure is convex, but not coherent [22]. In contrast, our distributionally robust model provides a direct interpretation of the exact set of noise distributions against which the controller provides safeguards, and leads to a computationally tractable framework for finding the optimal controller. In this sense, our work is more directly related to the literature on distributionally robust control, which seeks controllers that minimize expected costs under worst-case noise distributions [11,33,34,41,50,52]. Closest to our work are [28,33]. [33] proves the optimality of linear state-feedback control policies for a related minimax LQR model with a Wasserstein distance but with perfect state observations. With perfect observations, the optimal policies in the classic LQR formulation are independent of the noise distribution and are thus inherently already robust, so considering imperfect observations is what makes the problem significantly more challenging in our case. [28] studies a minimax formulation based on the Wasserstein distance with both state and observation noise but without any control policy, and focuses solely on the problem of estimating the states. Several papers have considered robust formulations with imperfect observations but for constrained systems [5,6,34], which are more challenging; the common approach is to restrict attention to linear feedback policies for computational tractability, and without proving their optimality. Also related is the recent literature stream on distributionally robust optimization using the Wasserstein distance [36]. Within this stream, the closest work is [38,44], which consider the problem of minimax mean-squared-error estimation when ambiguity is modeled with a Wasserstein distance from a nominal Gaussian distribution. Our proof builds on some ideas from these papers (e.g., relying on the Gelbrich distance in the construction of the upper bound), which it combines with ideas from control theory on purified output-feedback to obtain the overall construction. Also related is [2], which studies multistage distributionally robust problems with ambiguity sets given by a nested Wasserstein distance for stochastic processes and identifies computationally tractable cases. For a broader overview of developments related to optimal transport and Wasserstein distance with an emphasis on computational tractability and applications in machine learning, we refer to [42].
Finally, our paper is also related to literature that documents the optimality of linear/affine policies in (distributionally) robust dynamic optimization models. [10,30] prove optimality for one-dimensional linear systems affected by additive noise and with perfect state observations, but with general (convex) state and/or control costs, [27,49] provide computationally tractable approaches to quantifying the suboptimality of affine controllers in finite or infinite-horizon settings, and [9,21,25] characterize the performance of affine policies in two-stage (distributionally) robust dynamic models.
Notation. All random objects are defined on a probability space (Ω, F, P). Thus, the distribution of any random vector ξ : Ω → R d is given by the pushforward distribution P ξ = P • ξ -1 of P with respect to ξ. The expectation under P is denoted by E P [•]. For any t ∈ Z + , we set [t] = {0, . . . , t}.

Section: Problem Definition
We consider a discrete-time linear dynamical system, a common model in control theory, described by:
x t+1 = A t x t + B t u t + w t ∀t ∈ [T -1](1)
Here, x t ∈ R n represents the system state, u t ∈ R m are the control inputs, and w t ∈ R n denotes the process noise. The system dynamics are governed by matrices A t ∈ R n×n and B t ∈ R n×m . The controller's information is limited to imperfect state measurements:
y t = C t x t + v t ∀t ∈ [T -1](2)
These measurements are corrupted by observation noise v t ∈ R p , with C t ∈ R p×n . Typically, p ≤ n, implying that direct state reconstruction from measurements is not straightforward, even without noise. The control inputs u t are causal, meaning they depend solely on past and current observations y 0 , . . . , y t , but not on future information. Formally, the set of feasible control inputs U y comprises random vectors (u 0 , u 1 , . . . , u T -1 ) where each u t is generated by a measurable control policy φ t : R p(t+1) → R m such that u t = φ t (y 0 , . . . , y t ). The objective is to minimize a quadratic cost function that penalizes deviations in states and control efforts:
J = T -1 t=0 (x ⊤ t Q t x t + u ⊤ t R t u t ) + x ⊤ T Q T x T ,(3)
where Q t ∈ S n + and R t ∈ S m ++ are positive semi-definite and positive definite matrices, respectively, representing state and input costs. The exogenous random vectors x 0 (initial state), {w t } T -1 t=0 (process noise), and {v t } T -1 t=0 (observation noise) are assumed to be mutually independent, following probability distributions P x0 , {P wt } T -1 t=0 , and {P vt } T -1 t=0 . Given the causality of control inputs, x t , u t , and y t can be expressed as measurable functions of the exogenous uncertainties up to time t. Without loss of generality, we define the probability space Ω = R n × R n×T × R p×T as the space of realizations of these uncertainties, with F as the Borel σ-algebra and P = P x0 ⊗ (⊗ T -1 t=0 P wt ) ⊗ (⊗ T t=0 P vt ), where ⊗ denotes independent coupling.

In the classic LQG model, P is assumed to be known and Gaussian, and the problem aims to find u ∈ U y that minimizes E P [J]. Appendix §A details the standard approach using Kalman filtering and dynamic programming. However, this paper addresses a more realistic and challenging scenario: the noise distributions are unknown. We model this uncertainty by assuming P belongs to an ambiguity set W, and we formulate a distributionally robust LQG problem that seeks u ∈ U y to minimize the worst-case expected cost:
max P∈W E P T -1 t=0 (x ⊤ t Q t x t + u ⊤ t R t u t ) + x ⊤ T Q T x T .(4)
We construct the ambiguity set W as a ball based on the Wasserstein distance, centered around a nominal Gaussian distribution. Specifically, we assume a nominal distribution
P = Px0 ⊗ (⊗ T -1 t=0 Pwt ) ⊗ (⊗ T t=0 Pvt
) is available, where Px0 = N (0, X0 ), Pwt = N (0, Ŵt ), and Pvt = N (0, Vt ) for all t ∈ [T -1]. The ambiguity set W is then defined as:
W = W x0 ⊗ (⊗ T -1 t=0 W wt ) ⊗ (⊗ T -1 t=0 W vt )
, where
W x0 = {P x0 ∈ P(R n ) : W( Px0 , P x0 ) ≤ ρ x0 , E Px 0 [x 0 ] = 0} W wt = {P wt ∈ P(R n ) : W( Pwt , P wt ) ≤ ρ wt , E Pw t [w t ] = 0} W vt = {P vt ∈ P(R m ) : W( Pvt , P vt ) ≤ ρ vt , E Pv t [v t ] = 0},
and W is the 2-Wasserstein distance. This construction ensures that all exogenous random variables x 0 , w 0 , . . . , w T -1 , v 0 , . . . , v T -1 remain independent under any distribution within W.
Definition 1 (2-Wasserstein distance). The 2-Wasserstein distance between two distributions P 1 and P 2 on R d with finite second moments is given by
W(P 1 , P 2 ) = inf π∈Π(P1,P2) R d ×R d ∥ξ 1 -ξ 2 ∥ 2 2 π(dξ 1 , dξ 2 ) 1 2
, where Π(P 1 , P 2 ) denotes the set of all couplings, that is, all joint distributions of the random variables ξ 1 and ξ 2 with marginal distributions P 1 and P 2 , respectively.
Our model strictly generalizes the classic LQG setting, 1 which is recovered when all Wasserstein radii ρ x0 , ρ wt , and ρ vt are set to 0. These parameters ρ thus quantify the level of uncertainty about the nominal model, enabling the construction of robust controllers against model misspecification. A key challenge is that the Wasserstein ambiguity set W encompasses many non-Gaussian distributions, making it non-trivial to ascertain if the worst-case distribution in (4) is Gaussian. Furthermore, the non-convex nature of W, due to the independence assumption of exogenous uncertainties, adds significant complexity to solving the distributionally robust LQG problem.

Section: Nash Equilibrium and Optimality of Linear Output Feedback Controllers
We henceforth view the distributionally robust LQG problem as a zero-sum game between the controller, who chooses causal control inputs, and nature, who chooses a distribution P ∈ W. In this section we show that this game admits a Nash equilibrium, where nature's Nash strategy is a Gaussian distribution P ⋆ ∈ W and the controller's Nash strategy is a linear output feedback policy based on the Kalman filter evaluated under P ⋆ .
Purified Observations. Before outlining our proof strategy, we first simplify the problem formulation by re-parametrizing the control inputs in a more convenient form (following [5,6,27]). Note that the control inputs in the LQG formulation are subject to cyclic dependencies, as u t depends on y t , while y t depends on x t through (2), and x t depends again on u t through (1), etc. Because these dependencies make the problem hard to analyze, it is preferable to instead consider the controls as functions of a new set of so-called purified observations instead of the actual observations y t .
Specifically, we first introduce a fictitious noise-free system
xt+1 = A t xt + B t u t ∀t ∈ [T -1] and ŷt = C t xt ∀t ∈ [T -1]
with states xt ∈ R n and outputs ŷt ∈ R p , which is initialized by x0 = 0 and controlled by the same inputs u t as the original system (2). We then define the purified observation at time t as η t = y t -ŷt and we use η = (η 0 , . . . , η T -1 ) to denote the trajectory of all purified observations. As the inputs u t are causal, the controller can compute the fictitious state xt and output ŷt from the observations y 0 , . . . , y t . Thus, η t is representable as a function of y 0 , . . . , y t . Conversely, one can show by induction that y t can also be represented as a function of η 0 , . . . , η t . Moreover, any measurable function of y 0 , . . . , y t can be expressed as a measurable function of η 0 , . . . , η t and viceversa [27,Proposition II.1]. So if we define U η as the set of all control inputs (u 0 , u 1 , . . . , u T -1 ) so that u t = ψ t (η 0 , . . . , η t ) for some measurable function
ψ t : R p(t+1) → R m for every t ∈ [T -1],
the above reasoning implies that U η = U y .
In view of this, we can rewrite the distributionally robust LQG problem equivalently as
p ⋆ = min x,u,y max P∈W E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U y , x = Hu + Gw, y = Cx + v = min x,u max P∈W E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw,(5)
where x = (x 0 , . . . , x T ), u = (u 0 , . . . , u T -1 ), y = (y 0 , . . . , y T -1 ), w = (x 0 , w 0 , . . . , w T -1 ), v = (v 0 , . . . , v T -1 ), η = (η 0 , . . . , η T -1 ), and R, Q, H, G and C are suitable block matrices (see Appendix §B for their precise definitions). The latter reformulation involving the purified observations η is useful because these are independent of the inputs. Indeed, by recursively combining the equations of the original and the noise-free systems, one can show that η = Dw + v for some block triangular matrix D (see Appendix §B for its construction). This shows that the purified observations depend (linearly) on the exogenous uncertainties but not on the control inputs. Hence, the cyclic dependencies complicating the original system are eliminated in (5).
Subsequently, we also study the dual of ( 5), defined as
d ⋆ = max P∈W min x,u E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw.(6)
The classic minimax inequality implies that p ⋆ ≥ d ⋆ . If we can prove that p ⋆ = d ⋆ , that (5) has a solution u ⋆ and that ( 6) has a solution P ⋆ , then (u ⋆ , P ⋆ ) must be a Nash equilibrium of the zero-sum game at hand [43,Theorem 2]. However, because U η is an infinite-dimensional function space and W is an infinite-dimensional, non-convex set of non-parametric distributions, the existence of a Nash equilibrium (in pure strategies) is not at all evident. Instead, our proof strategy will rely on constructing an upper bound for p ⋆ and a lower bound for d ⋆ , and showing that these match.
Upper Bound for p ⋆ . We obtain an upper bound for p ⋆ by suitably enlarging the ambiguity set W and restricting the controllers u t to linear dependencies. We enlarge W by ignoring all information about the distributions in W except for their covariance matrices, and by replacing the Wasserstein distance with the Gelbrich distance. To that end, we first define the Gelbrich distance on the space of covariance matrices. Definition 2 (Gelbrich distance). The Gelbrich distance between the two covariance matrices
Σ 1 , Σ 2 ∈ S d + is given by G(Σ 1 , Σ 2 ) = Tr Σ 1 + Σ 2 -2 Σ 1 2 2 Σ 1 Σ 1 2 2 1 2
.
We are interested in the Gelbrich distance because of its close connection to the 2-Wasserstein distance. Indeed, it is known that the 2-Wasserstein distance between two distributions with zero means is bounded below by the Gelbrich distance between the respective covariance matrices. 
) ≥ G(Σ 1 , Σ 2 ).
Recalling that X0 , Ŵ t and V t respectively denote the covariance matrices for x 0 , w t and v t under the nominal distribution P, we can then define the following Gelbrich ambiguity set for the exogenous uncertainties:
G = G x0 ⊗ (⊗ T -1 t=0 G wt ) ⊗ (⊗ T -1 t=0 G vt )
, where
G x0 = {P x0 ∈ P(R n ) : E Px 0 [x 0 ] = 0, E P [x 0 x ⊤ 0 ] = X 0 , G(X 0 , X0 ) ≤ ρ x0 } G wt = {P wt ∈ P(R n ) : E Pw t [w t ] = 0, E P [w t w ⊤ t ] = W t , G(W t , Ŵt ) ≤ ρ wt } G vt = {P vt ∈ P(R m ) : E Pv t [v t ] = 0, E P [v t v ⊤ t ] = V t , G(V t , Vt ) ≤ ρ vt }.
By construction, the random vectors x 0 , {w t } T -1 t=0 and {v t } T -1 t=0 are thus mutually independent under any P ∈ G. In addition and as a direct consequence of Proposition 3.1, G constitutes an outer approximation for the Wasserstein ambiguity set W, as summarized in the next result. Corollary 1 (Gelbrich hull). We have W ⊆ G.
Because G covers W, we henceforth refer to it as the Gelbrich hull of the Wasserstein ambiguity set W. To finalize our construction of the upper bound on p ⋆ , we focus on linear policies 2 of the form u = q + U η = q + U (Dw + v), where q = (q 0 , . . . , q T -1 ), and U is a block lower triangular matrix
U =     U 0,0 U 1,0 U 1,1 . . . . . . U T -1,0 . . . . . . U T -1,T -1     .(7)
The block lower triangularity of U ensures that the corresponding controller is causal, which in turn ensures that u ∈ U η . In the following, we denote by U the set of all block lower triangular matrices of the form (7). An upper bound on problem (5) can now be obtained by restricting the controller's feasible set to causal controllers that are linear in the purified observations η and by relaxing nature's feasible set to the Gelbrich hull G of W. The resulting bounding problem is given by
p ⋆ = min U,q,x,u max P∈G E P u ⊤ Ru + x ⊤ Qx s.t. U ∈ U, u = q + U (Dw + v), x = Hu + Gw.(8)
As we obtained (8) by restricting the feasible set of the outer minimization problem and relaxing the feasible set of the inner maximization problem in (5), it is clear that p ⋆ ≥ p ⋆ . Recall also that problem ( 5) constitutes an infinite-dimensional zero-sum game, where the agents optimize over measurable policies and non-parametric distributions, respectively. In contrast, the next proposition shows that problem ( 8) is equivalent to a finite-dimensional zero-sum game. Proposition 3.2. Problem (8) is equivalent to the optimization problem
p ⋆ = min q∈R pT U ∈U max W ∈G W V ∈G V Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W +Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q,(9)
where
G W = W ∈ S n(T +1) + : W = diag(X 0 , W 0 , . . . , W T -1 ), X 0 ∈ S n + , W t ∈ S n + ∀t ∈ [T -1] G(X 0 , X0 ) 2 ≤ ρ 2 x0 , G(W t , Ŵt ) 2 ≤ ρ 2 wt ∀t ∈ [T -1] G V = V ∈ S pT + : V = diag(V 0 , . . . , V T -1 ), V t ∈ S p + , G(V t , Vt ) 2 ≤ ρ 2 vt ∀t ∈ [T -1] .
We emphasize that Proposition 3.2 remains valid even if the nominal distribution P fails to be normal. Note also that, while nature's feasible set in ( 8) is non-convex due to the independence conditions, the sets G W and G V are convex and even semidefinite representable thanks to the properties of the squared Gelbrich distance. 3 By dualizing the inner maximization problem, one can therefore reformulate the minimax problem (9) as a convex semidefinite program (SDP). Even though this SDP is computationally tractable in theory, it involves O(T (mp + n 2 + p 2 )) decision variables. For practically interesting problem dimensions, it thus quickly exceeds the capabilities of existing solvers.
Lower Bound for d ⋆ . To derive a tractable lower bound on d ⋆ , we restrict nature's feasible set to the family W N of all normal distributions in the Wasserstein ambiguity set W. The resulting bounding problem is thus given by
d ⋆ = max P∈W N min x,u E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw. (10
)
As we obtained (10) by restricting the feasible set of the outer maximization problem in (6), it is clear that d ⋆ ≤ d ⋆ . Next, we show that (10) can be recast as a finite-dimensional zero-sum game. This result critically relies on the following known fact regarding the 2-Wasserstein distance between two normal distributions, which coincides with the Gelbrich distance between their covariance matrices. Proposition 3.3 (Tightness for normal distributions [26,Proposition 7]). For any two normal distributions P 1 = N (0, Σ 1 ) and P 2 = N (0, Σ 2 ) with zero means we have
W(P 1 , P 2 ) = G(Σ 1 , Σ 2 ).
With this, we can provide a finite-dimensional reformulation, as summarized in the next result.
Proposition 3.4. Problem (10) is equivalent to the optimization problem
d ⋆ = max W ∈G W V ∈G V min q∈R pT U ∈U Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W +Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R + H ⊤ QH)q,(11)
where G W and G V are defined exactly as in Proposition 3.2.
Proposition 3.4 relies on Proposition 3.3 and thus fails to hold unless P is normal. Also, one can again reformulate (11) as a tractable SDP by dualizing the inner minimization problem.
Conclusions. Propositions 3.2 and 3.4 reveal that problems ( 9) and ( 11) are dual to each other, that is, they can be transformed into one another by interchanging minimization and maximization. The following main theorem shows that strong duality holds irrespective of the problem data. Theorem 3.5 (Strong duality of ( 9) and ( 11)). We have p ⋆ = d ⋆ .
Theorem 3.5 follows immediately from Sion's classic minimax theorem [45], which applies because G W and G V are convex as well as compact thanks to [38,Lemma A.6].
By weak duality and the construction of the bounding problems ( 9) and ( 11), we trivially have d ⋆ ≤ d ⋆ ≤ p ⋆ ≤ p ⋆ . Theorem 3.5 reveals that all of these inequalities are in fact equalities, each of which gives rise to a non-trivial insight. The first key insight is that ( 5) and ( 6) are strong duals.
Corollary 2 (Strong duality of ( 5) and ( 6)). We have p ⋆ = d ⋆ .
We stress that, unlike Theorem 3.5, Corollary 2 establishes strong duality between two infinitedimensional zero-sum games. The second key implication of Theorem 3.5 is that the distributionally robust LQG problem ( 5) is solved by a linear output-feedback controller.
Corollary 3 (The controller's Nash strategy is linear in the observations). There exist U ⋆ ∈ U and q ⋆ ∈ R m such that the distributionally robust LQG problem (5) is solved by u ⋆ = q ⋆ + U ⋆ y.
The identity p ⋆ = p ⋆ readily implies that ( 5) is solved by a causal controller that is linear in the purified observations. However, any causal controller that is linear in the purified observations η can be reformulated exactly as a causal controller that is linear in the original observations y and vice-versa [6, Proposition 3]. Thus, Corollary 3 follows. The third key implication of Theorem 3.5 is that the dual distributionally robust LQG problem is solved by a normal distribution. Corollary 4 (Nature's Nash strategy is a normal distribution). The dual distributionally robust LQG problem (6) is solved by a distribution P ⋆ ∈ W N .
Corollary 4 is a direct consequence of the identity d ⋆ = d ⋆ . Note that the optimal normal distribution P ⋆ is uniquely determined by the covariance matrices W ⋆ and V ⋆ of the exogenous uncertain parameters, which can be computed by solving problem (11). That the worst-case distribution is actually Gaussian is not a-priori expected and is surprising given that the Wasserstein ball contains many non-Gaussian distributions.

Section: Efficient Numerical Solution of Distributionally Robust LQG Problems
Having proven these structural results, we next turn attention to the problem of finding the optimal strategies. Our next result shows that, under a mild regularity condition, the optimal controller u ⋆ of the distributionally robust LQG problem (5) can be computed efficiently from P ⋆ . then problem (6) is solved by a Gaussian distribution P ⋆ under which v t has a covariance matrix V ⋆ t ≻ 0 for every t ∈ [T -1], and (5) is solved by the optimal LQG controller corresponding to P ⋆ . Additionally, the optimal value of problem (9) and its strong dual (11) does not change if we restrict G W and G V to G + W and G + V , respectively, where
G + W = W ∈ G W : X 0 ⪰ λ min ( X0 )I, W t ⪰ λ min ( Ŵt )I ∀t ∈ [T -1] , G + V = V ∈ G V : V t ⪰ λ min ( Vt )I ∀t ∈ [T -1] .
This implies that the optimal controller can be computed by solving a classic LQG problem corresponding to nature's optimal strategy P ⋆ , which can be done very efficiently through Kalman filtering and dynamic programming (see Appendix §A for details). It thus suffices to design an efficient algorithm for computing P ⋆ , which is uniquely determined by the covariance matrices (W ⋆ , V ⋆ ) that solve problem (11). To this end, we first reformulate (11) as max
W ∈G + W ,V ∈G + V f (W, V ),(12)
where we restrict G W and G V to G + W and G + V , respectively, due to Proposition 4.1, and where f (W, V ) denotes the optimal value function of the inner minimization problem in (11). As ( 11) is a reformulation of (10) and as the family of all causal purified output-feedback controllers matches the family of causal output-feedback controllers, f (W, V ) can also be viewed as the optimal value of the classic LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . These insights lead to the following structural result. Proposition 4.2. f (W, V ) is concave and β-smooth in (W, V ) ∈ G + W × G + V for some β > 0. By Proposition 4.2, it is possible to address problem (12) with a Frank-Wolfe algorithm [13,18,19,20,23,35]. Each iteration of this algorithm solves a direction-finding subproblem, that is, a variant of problem ( 12) that maximizes the first-order Taylor expansion of f (W, V ) around the current iterates. max
L W ∈G + W ,L V ∈G + V ⟨∇ W f (W, V ), L W -W ⟩ + ⟨∇ V f (W, V ), L V -V ⟩(13)
The next iterates are then obtained by moving towards a maximizer (L ⋆ W , L ⋆ V ) of ( 13), i.e., we update
(W, V ) ← (W, V ) + α • (L ⋆ W -W, L ⋆ v -V )
, where α is an appropriate step size. The proposed Frank-Wolfe algorithm enjoys a very low periteration complexity because problem ( 13) is separable. To see this, we reformulate (13) as
max L W ,L V ⟨∇ X0 f (W, V ), L X0 -X 0 ⟩+ T -1 t=0 ⟨∇ Wt f (W, V ), L Wt -W t ⟩ + ⟨∇ Vt f (W, V ), L Vt -V t ⟩ s.t. G(L X0 , X0 ) 2 ≤ ρ 2 x0 , G(L Wt , Ŵt ) 2 ≤ ρ 2 wt , G(L Vt , Vt ) 2 ≤ ρ 2 vt ∀t ∈ [T -1] L X0 ⪰ λ min ( X0 )I, L Wt ⪰ λ min ( Ŵt )I, L Vt ⪰ λ min ( Vt )I ∀t ∈ [T -1].
Hence, (13) decomposes into 2T + 1 separate subproblems that can be solved in parallel. That is, for any matrix Z ∈ {X 0 , W 0 , . . . , W T -1 , V 0 , . . . , V T -1 } we solve a separate subproblem of the form max
L Z ⪰λmin( Ẑ) ⟨∇ Z f (W, V ), L Z -Z⟩ : G(L Z , Ẑ) 2 ≤ ρ 2 z . (14
)
These subproblems can be reformulated as tractable SDPs and are thus amenable to efficient offthe-shelf solvers. By [38,Theorem 6.2], however, one can exploit the structure of the Gelbrich distance in order to reduce (14) to a univariate algebraic equation that can be solved to any desired accuracy δ > 0 by a highly efficient bisection algorithm. We say that L δ Z is a δ-approximate solution of problem (14) for some δ ∈ (0, 1) if L δ Z is feasible in (14) and if
⟨∇ Z f (W, V ), L δ Z -Z⟩ ≥ δ⟨∇ Z f (W, V ), L ⋆ Z -Z⟩, where L ⋆
Z is an exact maximizer of ( 14). Note that, by the concavity of f (W, V ), the inner product on the right-hand side is nonnegative and vanishes if and only if Z maximizes f (W, V ) over the feasible set of (14). For further details we refer to Appendix §E in the supplementary material. Remark 1 (Automatic differentiation). Recall that f (W, V ) coincides with the optimal value of the LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . By using the underlying dynamic programming equations, f (W, V ) can thus be expressed in closed form as a serial composition of O(T ) rational functions (see Appendix §A for details). Hence, ∇ Z f (W, V ) can be calculated symbolically for any Z ∈ {X 0 , W 0 , . . . , W T -1 , V 0 , . . . , V T -1 } by repeatedly applying the chain and product rules. However, the resulting formulas are lengthy and cumbersome. We thus compute the gradients numerically using backpropagation. The cost of evaluating ∇ Z f (W, V ) is then of the same order of magnitude as the cost of evaluating f (W, V ).
A detailed description of the proposed Frank-Wolfe method is given in Algorithm 1 below.
By [31,Theorem 1 and Lemma 7], which applies thanks to the structural properties of f (W, V ) established in Proposition 4.2, Algorithm 1 attains a suboptimality gap of ϵ within O(1/ϵ) iterations.
Algorithm 1 Frank-Wolfe algorithm for solving (12) Input: initial iterates W , V , nominal covariance matrices Ŵ , V , oracle precision δ ∈ (0, 1) 1: set initial iteration counter k = 0 2: while stopping criterion is not met do 3:
for Z ∈ {X 0 , W 0 , . . . , W T -1 , V 0 , . . . , V T -1 } do in parallel 4: compute ∇ Z f (W, V ) 5:
find a δ-approximate solution L δ Z of ( 14)
6: 
end 7: g ← ⟨∇ W f (W, V ), L δ W -W ⟩ + ⟨∇ V f (W, V ), L δ V -V ⟩ 8: (W, V ) ← (W, V ) + 2/(2 + k) • (L δ W -W, L δ V -V ) 9:

Section: Numerical Experiments
All experiments are run on an Intel i7-8700 CPU (3.2 GHz) machine with 16GB RAM. All linear SDP problems are modeled in Python 3.8.6 using CVXPY [1,14] and solved with MOSEK [37]. The gradients of f (W, V ) are computed via Pymanopt [48] with PyTorch's automated differentiation module [39,40].
Consider a class of distributionally robust LQG problems with n = m = p = 10. We set A t = 0.1×A to have ones on the main diagonal and the superdiagonal and zeroes everywhere else (A i,j = 1 if i = j or i = j -1 and A i,j = 0 otherwise), and the other matrices to
B t = C t = Q t = R t = I d .
The Wasserstein radii are set to ρ x0 = ρ wt = ρ vt = 10 -1 . The nominal covariance matrices of the exogenous uncertainties are constructed randomly and with eigenvalues in the interval [1,2] (so as to ensure they are positive definite). The code is publicly available in the Github repository https: //github.com/RAO-EPFL/DR-Control.
The optimal value of the distributionally robust LQG problem (5) can be computed by directly solving the SDP reformulation of (11) with MOSEK or by solving the nonlinear SDP (12) with our Frank-Wolfe method detailed in Algorithm 1. We next compare these two approaches in 10 independent simulation runs, where we set a stopping criterion corresponding to an optimality gap below 10 -3 and we run the Frank-Wolfe method with δ = 0.95. Figure 1a illustrates the execution time for both approaches as a function of the planning horizon T ; runs where MOSEK exceeds 100s are not reported. Figure 1b visualizes the empirical convergence behavior of the Frank-Wolfe algorithm. The results highlight that the Frank-Wolfe algorithm achieves running times that are uniformly lower than MOSEK across all problem horizons and is able to find highly accurate solutions already after a small number of iterations (50 iterations for problem instances of time horizon T = 10).

Section: Concluding Remarks and Limitations
In view of the popularity of LQG models, the results in this work carry important theoretical and practical implications. Despite considering a generalization of the classic LQG setting where the noise affecting the system dynamics and the observations follows unknown (and potentially non-Gaussian) distributions, our findings suggest that certain classic structural results continue to hold and that highly efficient methods can be adapted to tackle this more realistic (and more challenging) problem. Specifically, that control policies depending linearly on observations continue to be optimal and that the worst-case distribution turns out to be Gaussian is surprising from a theoretical angle and also has direct practical implications, because it allows leveraging the highly efficient Kalman filter in conjunction with dynamic programming and a Frank-Wolfe method to design an efficient computational procedure for solving the problem.
The results also raise several important questions that warrant future exploration. First, it would be highly relevant to consider extensions where the system matrices are also affected by uncertainty, as this captures many applications of practical interest in, e.g., reinforcement learning or revenue management. Second, it would be worth exploring an infinite horizon setting or relaxing the assumption that the nominal distribution is Gaussian, as both assumptions may be limiting the practical appeal of the framework. Third, one could also attempt to prove structural optimality results or design novel algorithms for generating high-quality suboptimal solutions for the more general setting involving constraints on states and/or control inputs. Lastly, one could improve the present algorithmic proposal by exploiting topological properties of the objective so as to guarantee linear convergence rates in the Frank-Wolfe procedure. respectively. Similarly, the stacked matrices appearing in the linear dynamics and the measurement equations C ∈ R pT ×n(T +1) , G ∈ R n(T +1)×n(T +1) and H ∈ R n(T +1)×mT are defined as
C =     C 0 0 C 1 0 . . . . . . C T -1 0     , G =     A 0 0 A 1 0 A 1 1 . . . . . . A T 0 A T 1 . . . A T T     and H =          0 A 1 1 B 0 0 A 2 1 B 0 A 2 2 B 1 0 . . . . . . . . . 0 A T 1 B 0 A T 2 B 1 . . . . . . A T T B T -1         
, respectively, where A t s = t-1 k=s A k for every s < t and A t s = I n for s = t. Using the stacked system matrices, we can now express the purified observation process η as a linear function of the exogenous uncertainties w and v that is not impacted by u; see also [5,46] Lemma B.1. We have η = Dw + v, where D = CG.
Proof of Lemma B.1. The purified observation process is defined as η = y -ŷ. Recall now that the observations of the original system satisfy y = Cx + v. Similarly, one readily verifies that the observations of the fictitious noise-free system satisfy ŷ = C x. Thus, we have η = C(x -x) + v. Next, recall that the state of the original system satisfies x = Hu + Gw, and note that the state of the fictitious noise-free system satisfies x = Hu. Combining all of these linear equations finally shows that u cancels out and that η = CGw + v = Dw + v.

Section: C. Proofs


Section: C.1. Additional Technical Results
It is well known that every causal controller that is linear in the original observations y can be reformulated as a causal controller that is linear in the purified observations η and vice versa [5,46]. Perhaps surprisingly, however, the one-to-one transformation between the respective coefficients of y and η is not linear. To keep this paper self-contained, we review these insights in the next lemma. Lemma C.1. If u = U η + q for some U ∈ U and q ∈ R pT , then u = U ′ y + q ′ for U ′ = (I + U CH) -1 U and q ′ = (I + U CH) -1 q. Conversely, if u = U ′ y + q ′ for some U ′ ∈ U and q ′ ∈ R pT , then u = U η + q for U = (I -U ′ CH) -1 U ′ and q = (I -U ′ CH) -1 q ′ . Proof of Lemma C.1. If u = U η + q for some U ∈ U and q ∈ R pT , then we have
u = U η + q = U (y -ŷ) + q = U y -U C x + q = U y -U CHu + q,
where the second equality follows from the definition of η, the third equality holds because y = Cx+v, and the last equality exploits our earlier insight that ŷ = C x. The last expression depends only on y and u. Solving for u yields u = U ′ y + q ′ , where U ′ = (I + U CH) -1 U and q ′ = (I + U CH) -1 q. Note that (I + U CH) is indeed invertible because I + U CH is a lower triangular matrix with all diagonal entries equal to one, ensuring a determinant of one.
Similarly, if u = U ′ y + q ′ for some U ′ ∈ U and q ′ ∈ R pT , then we have
u = U ′ y + q ′ = U ′ (η + ŷ) + q ′ = U ′ η + U ′ C x + q ′ = U ′ η + U ′ CHu + q ′ .
Solving for u yields u = U η + q, where U = (I -U ′ CH) -1 U ′ and q = (I -U ′ CH) -1 q ′ . Note again that (I -U ′ CH) is indeed invertible because (I -U ′ CH) is a lower triangular matrix with all diagonal entries equal to one.

Section: C.2. Proofs of Section 3
Proof of Proposition 3.2. In problem (8), both u and x are linear in w and v, i.e., u = q +U Dw +U v and x = Hu + Gw = Hq + HU Dw + HU v + Gw. By substituting the linear representations of u and x into the objective function of problem (8), we obtain the following equivalent reformulation.
min q∈R pT U ∈U max P∈G E P w ⊤ D ⊤ U ⊤ (R + H ⊤ QH)U D + 2D ⊤ U ⊤ H ⊤ QG + G ⊤ QG w + E P v ⊤ U ⊤ (R + H ⊤ QH)U v + q ⊤ (R + H ⊤ QH)q
For any fixed P ∈ G, we can express the expectation in the objective function of the above problem in terms of the covariance matrices W = E P [ww ⊤ ] and V = E P [vv ⊤ ]. Thus, the problem becomes
min q∈R pT U ∈U max W,V,P Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W + Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q s.t. P ∈ G, W = E P [ww ⊤ ], V = E P [vv ⊤ ]. (A.18)
Recall now the definition of G, and note that the requirements G(X 0 , X0 )
≤ ρ x0 , G(W t , Ŵt ) ≤ ρ wt and G(V t , Vt ) ≤ ρ vt are equivalent to the convex constraints G(X 0 , X0 ) 2 ≤ ρ 2 x0 , G(W t , Ŵt ) 2 ≤ ρ 2 wt and G(V t , Vt ) 2 ≤ ρ 2 vt , respectively, for all t ∈ [T -1].
The definition of G also implies that
W = E P [ww ⊤ ] = diag(X 0 , W 0 , . . . , W T -1 ) and V = E P [vv ⊤ ] = diag(V 0 , . . . , V T -1 ).
Problem (A.18) thus constitutes a relaxation of problem (9). Indeed, the feasible set of the inner maximization problem in (A. 18) is a subset of the feasible set of the inner maximization problem in (9). Moreover, for any W and V feasible in the inner maximization problem in (9), the distribution
P = P x0 ⊗ (⊗ T -1 t=0 P wt ) ⊗ (⊗ T t=0 P vt ) defined through P x0 = N (0, X 0 ), P wt = N (0, W t ) and P vt = N (0, V t ), t ∈ [T -1],
is feasible in the inner maximization problem in (A.18) with the same objective value. The relaxation is thus exact, and the optimal values of ( 8), ( 9) and (A.18) coincide.
Proof of Proposition 3.4. Recall that the space U y of all causal output-feedback controllers coincides with the space U η of all causal purified output-feedback controllers. We can thus replace the feasible set U η of the inner minimization problem in (10) with U y . Hence, for any fixed P ∈ W N , the inner minimization problem in (10) constitutes a classic LQG problem. By standard LQG theory [8], it is solved by a linear output-feedback controller of the form u = U ′ y+q ′ for some U ′ ∈ U and q ′ ∈ R pT ; see also Appendix §A. Lemma C.1 shows, however, that any linear output-feedback controller can be equivalently expressed as a linear purified-output feedback controller of the form u = U η + q for some U ∈ U and q ∈ R pT . In summary, the above reasoning shows that the feasible set of the inner minimization problem in (10) can be reduced to the family of all linear purified-output feedback controllers without sacrificing optimality. Thus, problem (10) 
is equivalent to max P∈W N min q,U,x,u E P u ⊤ Ru + x ⊤ Qx s.t. U ∈ U, u = q + U η, x = Hu + Gw.
Using a similar reasoning as in the proof of Proposition 3.2, we can now substitute the linear representations of u and x into the objective function and reformulate the above problem as max W,V,P min
q∈R pT U ∈U Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W + Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q s.t. P ∈ W N , W = E P [ww ⊤ ], V = E P [vv ⊤ ].
As W N contains only normal distributions, Proposition 3.3 implies that W(P x0 , Px0 ) = G(X 0 , X0 ), W(P wt , Pwt ) = G(W t , Ŵt ) and W(P vt , Pvt ) = G(V t , Vt ) for all t ∈ [T -1]. We may thus replace the requirement W(P x0 , Px0 ) ≤ ρ x0 in the definition of W N by G(X 0 , X0 ) ≤ ρ x0 , which is equivalent to the convex constraint G(X 0 , X0 ) 2 ≤ ρ 2 x0 . The conditions on the marginal distributions of w t and v t , t ∈ [T -1], admit similar reformulations. The definition of W N also implies that
W = E P [ww ⊤ ] = diag(X 0 , W 0 , . . . , W T -1 ) and V = E P [vv ⊤ ] = diag(V 0 , . . . , V T -1 ).

Section: D. SDP Reformulation of the Lower Problem (11)
Instead of solving the dual problem (11) with the customized Frank-Wolfe algorithm of Section 4, it can be reformulated as an SDP amenable to off-the-shelf solvers. This reformulation is obtained by dualizing the inner minimization problem and by exploiting the following preliminary lemma. Lemma D.1. For any Ẑ ∈ S d + and ρ z ≥ 0, the set
G Z = {Z ∈ S d + : G(Z, Ẑ) ≤ ρ z } coincides with Z ∈ S d + : ∃E z ∈ S d + with Tr(Z + Ẑ -2E z ) ≤ ρ 2 z , Ẑ 1 2 Z Ẑ 1 2 E z E z I ⪰ 0 .
Proof of Lemma D.1. By Definition 2, we have
G Z = {Z ∈ S d + : Tr(Z + Ẑ -2( Ẑ 1 2 Z Ẑ 1 2 ) 1 2 ) ≤ ρ 2 z }.
Next, introduce an auxiliary variable E z ∈ S d + subject to the matrix inequality 2 ). By [4, Theorem 1], this inequality can be recast as
E 2 z ⪯ ( Ẑ 1 2 Z Ẑ1
E z ⪯ ( Ẑ 1 2 Z Ẑ 1 2 ) 1 2
. Hence, we can reformulate the nonlinear matrix inequality in the above representation of G Z as Tr 2 ) is also equivalent to
(Z + Ẑ -2E z ) ≤ ρ 2 z . A standard Schur complement argument reveals that the inequality E 2 z ⪯ ( Ẑ 1 2 Z Ẑ1
Ẑ 1 2 Z Ẑ 1 2 E z E z I ⪰ 0.
The claim then follows by combining all of these insights.
We are now ready to derive the desired SDP reformulation of problem (11). where M t,s ∈ R m×p for every t, s ∈ Z with 1 ≤ t < s ≤ T .
Proposition D.2. If V ≻ 0, then problem (11) is equivalent to the SDP max Tr(G ⊤ QGW ) -Tr(F (R + H ⊤ QH) -1 ) s.t. W ∈ S n(T +1) + , V ∈ S pT + , M ∈ M, F ∈ S T m + E x0 ∈ S n + , E wt ∈ S n + , E vt ∈ S p + ∀t ∈ [T -1] Tr(W 0 + X0 -2E x0 ) ≤ ρ 2 x0 , Tr(W t+1 + Ŵt -2E wt ) ≤ ρ 2 wt , Tr(V t + Vt -2E vt ) ≤ ρ 2 vt ∀t ∈ [T -1] X 1 2 0 X 0 X 1 2 0 E x0 E x0 I n ⪰ 0, Ŵ 1 2 t W t+1 Ŵ 1 2 t E wt E wt I n ⪰ 0, V 1 2 t V t V 1 2 t E vt E vt I p ⪰ 0 ∀t ∈ [T -1] F H ⊤ QGW D ⊤ + M/2 (H ⊤ QGW D ⊤ + M/2) ⊤ DW D ⊤ + V ⪰ 0 W 0 ⪰ λ min(
Proof of Proposition D.2. The proof relies on dualizing the inner minimization problem in (11). Note that strong duality holds because the primal problem is trivially feasible and involves only equality constraints, which implies that any feasible point is in fact a Slater point. In the following we use M ∈ M to denote the Lagrange multiplier of the constraint U ∈ U, which requires all blocks of
In practice, we need to solve the algebraic equation (A.24) numerically. The numerical error in approximating γ ⋆ should be contained to ensure that L ⋆ approximates the exact maximizer of problem (A.23). The next proposition shows that, for any tolerance δ ∈ (0, 1), a δ-approximate solution of (A.23) can be computed with an efficient bisection algorithm. if dϕ dγ (γ) < 0 then set γ ← γ else γ ← γ endif 6: until dϕ dγ (γ) > 0 and ⟨L -Z, Γ Z ⟩ ≥ δϕ(γ) Output: L In summary, for any Z ∈ {X 0 , W 0 , . . . , W T -1 , V 0 , . . . , V T -1 }, Algorithm A.2 computes a δapproximate solutions to the direction-finding subproblem (14) with Γ Z = ∇ Z f (W, V ).

Section: F. Additional Information on Experiments
Generation of Nominal Covariance Matrices. The nominal covariance matrices of the exogenous uncertainties are constructed randomly using the following procedure. For each exogenous uncertainty z ∈ {x 0 , w 0 , . . . , w T -1 , v 0 , . . . , v T -1 }, we denote the dimension of z by d and sample a matrix M Z ∈ R d×d from the uniform distribution on the hypercube [0, 1] d×d . Next, we define Ξ Z ∈ R d×d as the orthogonal matrix whose columns represent the orthonormal eigenvectors of the symmetric matrix M Z + M ⊤ Z . Finally, we set Ẑ = Ξ Z Λ Z Ξ ⊤ Z , where Λ Z is a diagonal matrix whose main diagonal is sampled uniformly from the interval [1,2] d . The rationale for adopting this cumbersome procedure is to ensure that the covariance matrix Ẑ is positive definite.
Optimality Gap. The optimality gap of the Frank-Wolfe algorithm visualized in Figure 1b is calculated as the sum of the surrogate optimality gaps ⟨L δ Z -Z, ∇ Z f (W, V )⟩ across all Z ∈ {X 0 , W 0 . . . , W T -1 , V 0 , . . . , V T -1 }. For more information on the surrogate optimality gaps see [31].

Section: 
Acknowledgements. This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40_180545. Dan A. Iancu would like to acknowledge INSEAD for financial support during the duration of the project.

Section: Appendix
The supplementary material is structured as follows. Appendix §A presents the well-known solution to the classic LQG problem using dynamic programming and Filter estimation. Appendix §B provides the definitions of the stacked system matrices utilized in the compact formulation (5) of the distributionally robust LQG problem. Appendix §C contains the proofs of the formal statements in the main text and provides additional technical results. Appendix §D derives the SDP reformulation of the dual problem (11). Appendix §E, finally, elaborates on the bisection algorithm used for solving the linearization oracle of the Frank-Wolfe algorithm.

Section: A. Solution of the LQG Problem
The classic LQG problem can be solved efficiently via dynamic programming; see, e.g., [8]. That is, the unique optimal control inputs satisfy u ⋆ t = K t xt for every t ∈ [T -1], where K t ∈ R n×n is the optimal feedback gain matrix, and xt = E P [x t |y 0 , . . . , y t ] is the minimum mean-squared-error estimator of x t given the observation history up to time t. Thanks to the celebrated separation principle, K t can be computed by pretending that the system is deterministic and allows for perfect state observations, and xt can be computed while ignoring the control problem.
To compute K t , one first solves the deterministic LQR problem corresponding to the LQG problem at hand. Its value function x ⊤ t P t x t at time t is quadratic in x t , and P t obeys the backward recursion
The optimal feedback gain matrix K t can then be computed from P t+1 as
Since x t and (y 0 , . . . , y t ) are jointly normally distributed, the minimum mean-squared-error estimator xt can be calculated directly using the formula for the mean of a conditional normal distribution.
Alternatively, however, one can use the Kalman filter to compute xt recursively, which is significantly more insightful and efficient. The Kalman filter also recursively computes the covariance matrix Σ t of x t conditional on y 0 , . . . , y t and the covariance matrix Σ t+1|t of x t+1 conditional on y 0 , . . . , y t evaluated under P. Specifically, these covariance matrices obey the forward recursion
initialized by Σ 0|-1 = X 0 . Using Σ t|t-1 , we then define the Kalman filter gain as
which allows us to compute the minimum mean-squared-error estimator via the forward recursion
initialized by x0 = L 0 y 0 . One can also show that the optimal value of the LQG problem amounts to

Section: B. Definitions of Stacked System Matrices
The stacked system matrices appearing in the distributionally robust LQG problem (5) are defined as follows. First, the stacked state and input cost matrices Q ∈ S n(T +1) and R ∈ S mT are set to
Thus, the feasible set of the outer maximization problem in (11) constitutes a relaxation of that in (10). One readily verifies that the relaxation is exact by using similar arguments as in the proof of Proposition 3.2. Thus, the claim follows.
Proof of Theorem 3.5. By Proposition 3.2, p⋆ coincides with the minimum of (9). Similarly, by Proposition 3.4 d ⋆ coincides with the maximum of (11). Note that problems ( 9) and ( 11) only differ by the order of minimization and maximization. Note also that U is convex and closed, G W and G V are convex and compact by virtue of [38,Lemma A.6], and the (identical) trace terms in ( 9) and (11) are bilinear in (W, V ) and (U, q). The claim thus follows from Sion's minimax theorem [45].

Section: C.3. Proofs of Section 4
Note that Proposition 4.1 is consistent with Corollary 3 because the optimal LQG controller corresponding to P ⋆ is linear in the past observations.
Proof of Proposition 4.1. By [38,Lemma A.3], the inner problem in (9) admits a maximizer
. Thus, the optimal value of problem ( 9) and its strong dual (11) does not change if we restrict G W and G V to G + W and G + V , respectively. We may thus conclude that problem (11) has a maximizer
. This in turn implies that problem ( 6) is solved by a normal distribution P ⋆ under which the covariance matrix of the observation noise v t satisfies V ⋆ t ≻ 0 for every t ∈ [T -1]. As ( 5) and ( 6) are strong duals, the optimal solution u ⋆ of problem ( 5) forms a Nash equilibrium with P ⋆ , i.e., u ⋆ is a best response to P ⋆ and thus solves the classic LQG problem corresponding to P ⋆ . As R t ≻ 0 for every t ∈ [T -1], this best response u ⋆ is unique, and as V ⋆ T ≻ 0 for every t ∈ [T -1], u ⋆ is in fact the Kalman filter-based optimal outputfeedback strategy corresponding to P ⋆ (which can be obtained using the techniques highlighted in Appendix §A).

Section: Before proving Proposition 4.2, recall that
Proof of Proposition 4.2. The function f (W, V ) is concave because the objective function of the inner minimization problem in (11) is linear (and hence concave) in W and V and because concavity is preserved under minimization. To prove that f (W, V ) is β-smooth, we first recall from Proposition 3.3 that it coincides with the optimal value of the inner minimization problem in (10). As U η = U y , f (W, V ) can thus be viewed as the optimal value of the classic LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . Hence, f (W, V ) coincides with (A.17), where Σ t , for t ∈ [T -1], is a function of (W, V ) defined recursively through the Kalman filter equations (A. 16). Note that all inverse matrices in (A. 16) are well-defined because any V ∈ G + V is strictly positive definite. Therefore, Σ t constitutes a proper rational function (that is, a ratio of two polyonmials with the polynomial in the denominator being strictly positive) for every
if and only if the largest eigenvalue of the Hessian matrix of -f (W, V ) is bounded above by β throughout G + W × G + V . Also, the largest eigenvalue of the positive semidefinite Hessian matrix ∇ 2 (-f (W, V )) coincides with the spectral norm of ∇ 2 f (W, V ). We may thus set
where ∥ • ∥ 2 denotes the spectral norm. The supremum in the above maximization problem is finite and attained thanks to Weierstrass' theorem, which applies because f (W, V ) is twice continuously differentiable and the spectral norm is continuous, while the sets G + W and G + V are compact by virtue of [38,Lemma A.6]. This observation completes the proof. the matrix U above the main diagonal to vanish. The Lagrangian function of the inner minimization problem in (11) can therefore be represented as
Recall now that R ≻ 0 and Q ⪰ 0, and thus R + H ⊤ QH ≻ 0. Consequently, L is minimized by q ⋆ = 0 for any fixed U and M . In addition, the partial gradient of L with respect U is given by
Recall also that V ∈ G + V is strictly positive, which implies that DW D ⊤ + V ≻ 0 is invertible. As we already know that R + H ⊤ QH ≻ 0 is invertible, as well, L is minimized by
for any fixed M . Substituting both q ⋆ and U ⋆ into L yields the dual objective function
The dual of the inner minimization problem in (11) is thus given by max M ∈M g(M ). To linearize the dual objective function, we next introduce an auxiliary variable F ∈ S mT + subject to the matrix inequality F ⪰ (H ⊤ QGW D ⊤ + M/2)(DW D ⊤ + V ) -1 (H ⊤ QGW D ⊤ + M/2) ⊤ . By using a standard Schur complement reformulation, we can then rewrite the dual problem as
Next, by replacing the inner problem in (11) with its strong dual (A.21), we can reformulate (11) as 

Section: E. Bisection Algorithm for the Linearization Oracle
We now show that the direction-finding subproblem ( 14) can be solved efficiently via bisection. To this end, we first establish that ( 14) can be reduced to the solution of a univariate algebraic equation. in the interval (λ max (Γ Z ), ∞).


References:
[b0] Akshay Agrawal; Robin Verschueren; Steven Diamond; Stephen Boyd (2018). A rewriting system for convex optimization problems. Journal of Control and Decision
[b1] Rohit Arora; Rui Gao (2022). Data-driven multistage distributionally robust optimization with nested distance: Time consistency and tractable dynamic reformulations. Available at Optimization Online
[b2] François Auger; Mickael Hilairet; M Josep; Eric Guerrero; Teresa Monmasson; Seiichiro Orlowska-Kowalska;  Katsura (2013). Industrial applications of the Kalman filter: A review. IEEE Transactions on Industrial Electronics
[b3] Richard Bellman (1968). Some inequalities for the square root of a positive definite matrix. Linear Algebra and its Applications
[b4] Aharon Ben-Tal; Stephen Boyd; Arkadi Nemirovski (2005). Control of uncertainty-affected discrete time linear systems via convex programming. Available at Optimization Online
[b5] Aharon Ben-Tal; Stephen Boyd; Arkadi Nemirovski (2006). Extending scope of robust optimization: Comprehensive robust counterparts of uncertain problems. Mathematical Programming
[b6] Denis S Bernstein; Wassim M Haddad (1988). LQG control with an H ∞ performance bound: A Riccati equation approach. 
[b7] Dimitri Bertsekas (2017). Dynamic Programming and Optimal Control. Athena Scientific
[b8] Dimitris Bertsimas; Vineet Goyal (2012). On the power and limitations of affine policies in two-stage adaptive optimization. Mathematical Programming
[b9] Dimitris Bertsimas; Dan A Iancu; Pablo A Parrilo (2010). Optimality of affine policies in multistage robust optimization. Mathematics of Operations Research
[b10] Dimitris Bertsimas; Dan A Iancu; Pablo A Parrilo (2011). A hierarchy of near-optimal policies for multistage adaptive optimization. IEEE Transactions on Automatic Control
[b11] Shen-Yong Chen (2012). Kalman filter for robot vision: A survey. IEEE Transactions on Industrial Electronics
[b12] Vladimir F Demyanov; Aleksandr M Rubinov (1970). Approximate Methods in Optimization Problems. Elsevier
[b13] Steven Diamond; Stephen Boyd (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research
[b14] John Doyle (1978). Guaranteed margins for LQG regulators. IEEE Transactions on Automatic Control
[b15] John Doyle; Keith Glover; Pramod Khargonekar; Bruce Francis (1988). State-space solutions to standard H 2 and H ∞ control problems. 
[b16] John Doyle; Kemin Zhou; Bobby Bodenheimer (1989). Optimal control with mixed H 2 and H ∞ performance objectives. 
[b17] C Joseph;  Dunn (1979). Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization
[b18] C Joseph;  Dunn (1980). Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization
[b19] Joseph C Dunn; S Harshbarger (1978). Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications
[b20] Omar El Housni; Vineet Goyal (2021). On the optimality of affine policies for budgeted uncertainty sets. Mathematics of Operations Research
[b21] Hans Föllmer; Alexander Schied (2011). Stochastic Finance: An Introduction in Discrete Time. de Gruyter
[b22] Marguerite Frank; Philip Wolfe (1956). An algorithm for quadratic programming. Naval Research Logistics
[b23] Matthias Gelbrich (1990). On a formula for the L 2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten
[b24] Angelos Georghiou; Angelos Tsoukalas; Wolfram Wiesemann (2021). On the optimality of affine decision rules in robust and distributionally robust optimization. Available at Optimization Online
[b25] Clark R Givens; Rae M Shortt (1984). A class of Wasserstein metrics for probability distributions. Michigan Mathematical Journal
[b26] J Michael; Paul J Hadjiyiannis; Daniel Goulart;  Kuhn (2011). An efficient method to estimate the suboptimality of affine controllers. IEEE Transactions on Automatic Control
[b27] Bingyan Han (2023). Distributionally robust Kalman filtering with volatility uncertainty. 
[b28] Lars ; Peter Hansen; Thomas J Sargent (2005). Robust estimation and control under commitment. Journal of Economic Theory
[b29] Dan A Iancu; Mayank Sharma; Maxim Sviridenko (2013). Supermodularity and affine policies in dynamic robust optimization. Operations Research
[b30] Martin Jaggi (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization. 
[b31] Aren Karapetyan; Andrea Iannelli; John Lygeros (2022). On the regret of H ∞ control. 
[b32] Kihyun Kim; Insoon Yang (2023). Distributional robustness in minimax linear quadratic control with Wasserstein distance. SIAM Journal on Control and Optimization
[b33] Georgios Kotsalis; Guanghui Lan; Arkadi S Nemirovski (2021). Convex optimization for finitehorizon robust covariance control of linear stochastic systems. SIAM Journal on Control and Optimization
[b34] S Evgenii; Boris T Levitin;  Polyak (1966). Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics
[b35] Peyman Mohajerin; Esfahani ; Daniel Kuhn (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming
[b36]  Mosek Aps (2019). The MOSEK Optimization Toolbox. 
[b37] Anh Viet; Soroosh Nguyen; Daniel Shafieezadeh-Abadeh; Peyman Kuhn; Esfahani Mohajerin (2023). Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. Mathematics of Operations Research
[b38] Adam Paszke; Sam Gross; Soumith Chintala; Gregory Chanan; Edward Yang; Zachary Devito; Zeming Lin; Alban Desmaison; Luca Antiga; Adam Lerer (2017). Automatic differentiation in PyTorch. 
[b39] Adam Paszke; Sam Gross; Francisco Massa; Adam Lerer; James Bradbury; Gregory Chanan; Trevor Killeen; Zeming Lin; Natalia Gimelshein; Luca Antiga; Alban Desmaison; Andreas Kopf; Edward Yang; Zachary Devito; Martin Raison; Alykhan Tejani; Sasank Chilamkurthy; Benoit Steiner; Lu Fang; Junjie Bai; Soumith Chintala (2019). Pytorch: An imperative style, high-performance deep learning library. 
[b40] Ian R Petersen; Matthiew R James; Paul Dupuis (2000). Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control
[b41] Gabriel Peyré; Marco Cuturi (2019). Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning
[b42] R Tyrrell Rockafellar (1974). Conjugate Duality and Optimization. SIAM
[b43] Soroosh Shafieezadeh-Abadeh; Anh Viet; Daniel Nguyen; Peyman Kuhn; Esfahani Mohajerin (2018). Wasserstein distributionally robust Kalman filtering. 
[b44] Maurice Sion (1958). On general minimax theorems. Pacific Journal of Mathematics
[b45] Joelle Skaf; Stephen P Boyd (2010). Design of affine controllers via convex optimization. IEEE Transactions on Automatic Control
[b46] Emanuel Todorov; Michael I Jordan (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience
[b47] James Townsend; Niklas Koep; Sebastian Weichwald (2016). Pymanopt: A Python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research
[b48] P G Bart; Paul J Van Parys; Manfred Goulart;  Morari (2013). Infinite horizon performance bounds for uncertain constrained systems. IEEE Transactions on Automatic Control
[b49] P G Bart; Daniel Van Parys; Paul J Kuhn; Manfred Goulart;  Morari (2016). Distributionally robust control of constrained stochastic systems. IEEE Transactions on Automatic Control
[b50] Peter Whittle (1981). Risk-sensitive linear/quadratic/Gaussian control. Advances in Applied Probability
[b51] Insoon Yang (2021). Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Transactions on Automatic
[b52] Kemin Zhou; John C Doyle (1998). Essentials of Robust Control. Prentice Hall

Figures:
Figure fig_0: 41
Type: figure
Caption: Proposition 4 . 1 (41Optimality of Kalman filter-based feedback controllers). If Vt ≻ 0 for all t ∈ [T -1],
Data: 

Figure fig_1: 101
Type: figure
Caption: end while 10 :Figure 1 :101Figure 1: (a) Execution time for MOSEK and Frank-Wolfe algorithm over 10 simulation runs as a function of the horizon T (solid lines show the mean and the shaded areas correspond to 1 standard deviation). (b) Convergence of optimality gap for Frank-Wolfe algorithm with horizon T = 10.
Data: 

Figure fig_2: 
Type: figure
Caption: X0 )I, W t+1 ⪰ λ min ( Ŵt )I, V t ⪰ λ min ( Vt )I ∀t ∈ [T -1]. (A.20)Here, M denotes the set of all strictly upper block triangular matrices of the form
Data: 

Figure fig_3: 23811
Type: figure
Caption: Proposition E. 2 ([ 38 , 1 2 1 223811Theorem 6.4]). For any fixed ρ z ∈ R ++ , Ẑ ∈ S d ++ andΓ Z ∈ S d + , Γ Z ̸ = 0, define G + Z = {Z ∈ S d + : G(Z, Ẑ) ≤ ρ z , Z ⪰ λ min ( Ẑ)}as the feasible set of problem (A.23), and let Z ∈ G + Z be any reference covariance matrix. Additionally, let δ ∈ (0, 1) be the desired oracle precision, and define φ(γ) = γ(ρ 2 +⟨γ(γI -Γ Z )-1 -I, Ẑ⟩)-⟨Z, Γ Z ⟩ for any γ > λ max (Γ Z ). Then, Algorithm A.2 returns in finite time a matrix L δ Z ∈ S d + with the following properties. (i) Feasibility:L δ Z ∈ G + Z (ii) δ-Suboptimality: ⟨L δ Z -Z, Γ Z ⟩ ≥ δ max L∈G + Z ⟨Γ Z , L -Z⟩. Algorithm A.2 Bisection algorithm to compute L δ Z Input: nominal covariance matrix Ẑ ∈ S d ++ , radius ρ ∈ R ++ , reference covariance matrix Z ∈ G + Z , gradient matrix Γ Z ∈ S d + , Γ Z ̸ = 0, precision δ ∈ (0,1), dual objective function ϕ(γ) defined in Proposition E.2 1: set λ 1 ← λ max (Γ Z ), and let p 1 be an eigenvector for λ 1 2: set γ ← λ 1 (1 + (p ⊤ 1 Ẑp 1 ) /ρ) and γ ← λ 1 (1 + Tr( Ẑ) /ρ) 3: repeat 4:set γ ← (γ + γ)/2 and L ← (γ) 2 (γI -Γ Z ) -1 Ẑ(γI -Γ Z ) -1 5:
Data: 


Formulas:
Formula formula_0: x t+1 = A t x t + B t u t + w t ∀t ∈ [T -1](1)

Formula formula_1: y t = C t x t + v t ∀t ∈ [T -1](2)

Formula formula_2: J = T -1 t=0 (x ⊤ t Q t x t + u ⊤ t R t u t ) + x ⊤ T Q T x T ,(3)

Formula formula_3: max P∈W E P T -1 t=0 (x ⊤ t Q t x t + u ⊤ t R t u t ) + x ⊤ T Q T x T .(4)

Formula formula_4: P = Px0 ⊗ (⊗ T -1 t=0 Pwt ) ⊗ (⊗ T t=0 Pvt

Formula formula_5: W = W x0 ⊗ (⊗ T -1 t=0 W wt ) ⊗ (⊗ T -1 t=0 W vt )

Formula formula_6: W x0 = {P x0 ∈ P(R n ) : W( Px0 , P x0 ) ≤ ρ x0 , E Px 0 [x 0 ] = 0} W wt = {P wt ∈ P(R n ) : W( Pwt , P wt ) ≤ ρ wt , E Pw t [w t ] = 0} W vt = {P vt ∈ P(R m ) : W( Pvt , P vt ) ≤ ρ vt , E Pv t [v t ] = 0},

Formula formula_7: W(P 1 , P 2 ) = inf π∈Π(P1,P2) R d ×R d ∥ξ 1 -ξ 2 ∥ 2 2 π(dξ 1 , dξ 2 ) 1 2

Formula formula_8: xt+1 = A t xt + B t u t ∀t ∈ [T -1] and ŷt = C t xt ∀t ∈ [T -1]

Formula formula_9: ψ t : R p(t+1) → R m for every t ∈ [T -1],

Formula formula_10: p ⋆ = min x,u,y max P∈W E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U y , x = Hu + Gw, y = Cx + v = min x,u max P∈W E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw,(5)

Formula formula_11: d ⋆ = max P∈W min x,u E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw.(6)

Formula formula_12: Σ 1 , Σ 2 ∈ S d + is given by G(Σ 1 , Σ 2 ) = Tr Σ 1 + Σ 2 -2 Σ 1 2 2 Σ 1 Σ 1 2 2 1 2

Formula formula_13: ) ≥ G(Σ 1 , Σ 2 ).

Formula formula_14: G = G x0 ⊗ (⊗ T -1 t=0 G wt ) ⊗ (⊗ T -1 t=0 G vt )

Formula formula_15: G x0 = {P x0 ∈ P(R n ) : E Px 0 [x 0 ] = 0, E P [x 0 x ⊤ 0 ] = X 0 , G(X 0 , X0 ) ≤ ρ x0 } G wt = {P wt ∈ P(R n ) : E Pw t [w t ] = 0, E P [w t w ⊤ t ] = W t , G(W t , Ŵt ) ≤ ρ wt } G vt = {P vt ∈ P(R m ) : E Pv t [v t ] = 0, E P [v t v ⊤ t ] = V t , G(V t , Vt ) ≤ ρ vt }.

Formula formula_16: U =     U 0,0 U 1,0 U 1,1 . . . . . . U T -1,0 . . . . . . U T -1,T -1     .(7)

Formula formula_17: p ⋆ = min U,q,x,u max P∈G E P u ⊤ Ru + x ⊤ Qx s.t. U ∈ U, u = q + U (Dw + v), x = Hu + Gw.(8)

Formula formula_18: p ⋆ = min q∈R pT U ∈U max W ∈G W V ∈G V Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W +Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q,(9)

Formula formula_19: G W = W ∈ S n(T +1) + : W = diag(X 0 , W 0 , . . . , W T -1 ), X 0 ∈ S n + , W t ∈ S n + ∀t ∈ [T -1] G(X 0 , X0 ) 2 ≤ ρ 2 x0 , G(W t , Ŵt ) 2 ≤ ρ 2 wt ∀t ∈ [T -1] G V = V ∈ S pT + : V = diag(V 0 , . . . , V T -1 ), V t ∈ S p + , G(V t , Vt ) 2 ≤ ρ 2 vt ∀t ∈ [T -1] .

Formula formula_20: d ⋆ = max P∈W N min x,u E P u ⊤ Ru + x ⊤ Qx s.t. u ∈ U η , x = Hu + Gw. (10

Formula formula_21: )

Formula formula_22: W(P 1 , P 2 ) = G(Σ 1 , Σ 2 ).

Formula formula_23: d ⋆ = max W ∈G W V ∈G V min q∈R pT U ∈U Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W +Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R + H ⊤ QH)q,(11)

Formula formula_24: G + W = W ∈ G W : X 0 ⪰ λ min ( X0 )I, W t ⪰ λ min ( Ŵt )I ∀t ∈ [T -1] , G + V = V ∈ G V : V t ⪰ λ min ( Vt )I ∀t ∈ [T -1] .

Formula formula_25: W ∈G + W ,V ∈G + V f (W, V ),(12)

Formula formula_26: L W ∈G + W ,L V ∈G + V ⟨∇ W f (W, V ), L W -W ⟩ + ⟨∇ V f (W, V ), L V -V ⟩(13)

Formula formula_27: (W, V ) ← (W, V ) + α • (L ⋆ W -W, L ⋆ v -V )

Formula formula_28: max L W ,L V ⟨∇ X0 f (W, V ), L X0 -X 0 ⟩+ T -1 t=0 ⟨∇ Wt f (W, V ), L Wt -W t ⟩ + ⟨∇ Vt f (W, V ), L Vt -V t ⟩ s.t. G(L X0 , X0 ) 2 ≤ ρ 2 x0 , G(L Wt , Ŵt ) 2 ≤ ρ 2 wt , G(L Vt , Vt ) 2 ≤ ρ 2 vt ∀t ∈ [T -1] L X0 ⪰ λ min ( X0 )I, L Wt ⪰ λ min ( Ŵt )I, L Vt ⪰ λ min ( Vt )I ∀t ∈ [T -1].

Formula formula_29: L Z ⪰λmin( Ẑ) ⟨∇ Z f (W, V ), L Z -Z⟩ : G(L Z , Ẑ) 2 ≤ ρ 2 z . (14

Formula formula_30: )

Formula formula_31: ⟨∇ Z f (W, V ), L δ Z -Z⟩ ≥ δ⟨∇ Z f (W, V ), L ⋆ Z -Z⟩, where L ⋆

Formula formula_32: for Z ∈ {X 0 , W 0 , . . . , W T -1 , V 0 , . . . , V T -1 } do in parallel 4: compute ∇ Z f (W, V ) 5:

Formula formula_33: end 7: g ← ⟨∇ W f (W, V ), L δ W -W ⟩ + ⟨∇ V f (W, V ), L δ V -V ⟩ 8: (W, V ) ← (W, V ) + 2/(2 + k) • (L δ W -W, L δ V -V ) 9:

Formula formula_34: B t = C t = Q t = R t = I d .

Formula formula_35: C =     C 0 0 C 1 0 . . . . . . C T -1 0     , G =     A 0 0 A 1 0 A 1 1 . . . . . . A T 0 A T 1 . . . A T T     and H =          0 A 1 1 B 0 0 A 2 1 B 0 A 2 2 B 1 0 . . . . . . . . . 0 A T 1 B 0 A T 2 B 1 . . . . . . A T T B T -1         

Formula formula_36: u = U η + q = U (y -ŷ) + q = U y -U C x + q = U y -U CHu + q,

Formula formula_37: u = U ′ y + q ′ = U ′ (η + ŷ) + q ′ = U ′ η + U ′ C x + q ′ = U ′ η + U ′ CHu + q ′ .

Formula formula_38: min q∈R pT U ∈U max P∈G E P w ⊤ D ⊤ U ⊤ (R + H ⊤ QH)U D + 2D ⊤ U ⊤ H ⊤ QG + G ⊤ QG w + E P v ⊤ U ⊤ (R + H ⊤ QH)U v + q ⊤ (R + H ⊤ QH)q

Formula formula_39: min q∈R pT U ∈U max W,V,P Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W + Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q s.t. P ∈ G, W = E P [ww ⊤ ], V = E P [vv ⊤ ]. (A.18)

Formula formula_40: ≤ ρ x0 , G(W t , Ŵt ) ≤ ρ wt and G(V t , Vt ) ≤ ρ vt are equivalent to the convex constraints G(X 0 , X0 ) 2 ≤ ρ 2 x0 , G(W t , Ŵt ) 2 ≤ ρ 2 wt and G(V t , Vt ) 2 ≤ ρ 2 vt , respectively, for all t ∈ [T -1].

Formula formula_41: W = E P [ww ⊤ ] = diag(X 0 , W 0 , . . . , W T -1 ) and V = E P [vv ⊤ ] = diag(V 0 , . . . , V T -1 ).

Formula formula_42: P = P x0 ⊗ (⊗ T -1 t=0 P wt ) ⊗ (⊗ T t=0 P vt ) defined through P x0 = N (0, X 0 ), P wt = N (0, W t ) and P vt = N (0, V t ), t ∈ [T -1],

Formula formula_43: is equivalent to max P∈W N min q,U,x,u E P u ⊤ Ru + x ⊤ Qx s.t. U ∈ U, u = q + U η, x = Hu + Gw.

Formula formula_44: q∈R pT U ∈U Tr D ⊤ U ⊤ (R+H ⊤ QH)U D+ 2G ⊤ QHU D +G ⊤ QG W + Tr U ⊤ (R + H ⊤ QH)U V +q ⊤ (R+H ⊤ QH)q s.t. P ∈ W N , W = E P [ww ⊤ ], V = E P [vv ⊤ ].

Formula formula_45: W = E P [ww ⊤ ] = diag(X 0 , W 0 , . . . , W T -1 ) and V = E P [vv ⊤ ] = diag(V 0 , . . . , V T -1 ).

Formula formula_46: G Z = {Z ∈ S d + : G(Z, Ẑ) ≤ ρ z } coincides with Z ∈ S d + : ∃E z ∈ S d + with Tr(Z + Ẑ -2E z ) ≤ ρ 2 z , Ẑ 1 2 Z Ẑ 1 2 E z E z I ⪰ 0 .

Formula formula_47: G Z = {Z ∈ S d + : Tr(Z + Ẑ -2( Ẑ 1 2 Z Ẑ 1 2 ) 1 2 ) ≤ ρ 2 z }.

Formula formula_48: E 2 z ⪯ ( Ẑ 1 2 Z Ẑ1

Formula formula_49: E z ⪯ ( Ẑ 1 2 Z Ẑ 1 2 ) 1 2

Formula formula_50: (Z + Ẑ -2E z ) ≤ ρ 2 z . A standard Schur complement argument reveals that the inequality E 2 z ⪯ ( Ẑ 1 2 Z Ẑ1

Formula formula_51: Ẑ 1 2 Z Ẑ 1 2 E z E z I ⪰ 0.

Formula formula_52: Proposition D.2. If V ≻ 0, then problem (11) is equivalent to the SDP max Tr(G ⊤ QGW ) -Tr(F (R + H ⊤ QH) -1 ) s.t. W ∈ S n(T +1) + , V ∈ S pT + , M ∈ M, F ∈ S T m + E x0 ∈ S n + , E wt ∈ S n + , E vt ∈ S p + ∀t ∈ [T -1] Tr(W 0 + X0 -2E x0 ) ≤ ρ 2 x0 , Tr(W t+1 + Ŵt -2E wt ) ≤ ρ 2 wt , Tr(V t + Vt -2E vt ) ≤ ρ 2 vt ∀t ∈ [T -1] X 1 2 0 X 0 X 1 2 0 E x0 E x0 I n ⪰ 0, Ŵ 1 2 t W t+1 Ŵ 1 2 t E wt E wt I n ⪰ 0, V 1 2 t V t V 1 2 t E vt E vt I p ⪰ 0 ∀t ∈ [T -1] F H ⊤ QGW D ⊤ + M/2 (H ⊤ QGW D ⊤ + M/2) ⊤ DW D ⊤ + V ⪰ 0 W 0 ⪰ λ min(
