Title: AMORTIZED CONTROL OF CONTINUOUS STATE SPACE FEYNMAN-KAC MODEL FOR IRREGULAR TIME SERIES

Abstract: Many real-world datasets, such as healthcare, climate, and economics, are often collected as irregular time series, which poses challenges for accurate modeling. In this paper, we propose the Amortized Control of continuous State Space Model (ACSSM) for continuous dynamical modeling of time series for irregular and discrete observations. We first present a multi-marginal Doob's h-transform to construct a continuous dynamical system conditioned on these irregular observations. Following this, we introduce a variational inference algorithm with a tight evidence lower bound (ELBO), leveraging stochastic optimal control (SOC) theory to approximate the intractable Doob's h-transform and simulate the conditioned dynamics. To improve efficiency and scalability during both training and inference, ACSSM leverages auxiliary variable to flexibly parameterize the latent dynamics and amortized control. Additionally, it incorporates a simulation-free latent dynamics framework and a transformer-based data assimilation scheme, facilitating parallel inference of the latent states and ELBO computation. Through empirical evaluations across a variety of real-world datasets, ACSSM demonstrates superior performance in tasks such as classification, regression, interpolation, and extrapolation, while maintaining computational efficiency. Code is available at https://github.com/bw-park/ACSSM.

Section: 
conditioned SDEs by approximating the Doob's h-transform. It allow us to propose a tight evidence lower bound (ELBO) for the aforementioned VI algorithm by establishing a fundamental connection between the partial differential equations (PDEs) associated with Doob's h-transform and SOC. The Doob's h-transform often referred to as the twist-function in Sequential Monte Carlo (SMC) literature (Guarniero et al., 2017) to approximate the smoothing distributions. Building on this, (Heng et al., 2020) introduced an algorithm to approximate the twisted transition kernel directly, while a recent concurrent study (Lu & Wang, 2024) extended this approach to continuous-time settings. However, both studies primarily emphasize approximation methods rather than practical applications.
In practical situations, the computation of ELBO for a VI algorithm might impractical due to the instability and high memory demands associated with gradient computation of the approximated stochastic dynamics over the entire sequence interval (Liu et al., 2024;Park et al., 2024). To address this issue, we propose two efficient modeling approaches: 1) We establish amortized inference by introducing an auxiliary variable to the latent space, generated by a neural network encoder-decoder. It maps the high-dimensional time-series into a suitable low-dimensional space, allowing more flexible parameterization of the latent dynamics. Moreover, amortization allows the inference of the posterior distribution for a novel time-series sequence without relying on Bayesian recursion by incorporating the learned control function. 2) We leverage the simulation-free property, which enables closed-form sampling from intermediate latent marginal distributions that can be computed in a temporally parallel way. Additionally, we explore a more flexible linear approximation of the drift function in controlled SDEs to enhance the efficiency of the proposed controlled dynamics.
We evaluated ACSSM on several time-series tasks across various real-world datasets. Our experiments show that ACSSM consistently outperforms existing baseline models in each tasks, demonstrating its effectiveness in capturing the underlying dynamics of irregular time-series. Additionally, ACSSM achieves significant computational efficiency, enabling faster training times compared to dynamicsbased models that rely on numerical simulations. A summary of the key concepts of ACSSM, along with related works, is provided in Appendix A. We summarize our contributions as follows:
• We extend the theory of Doob's h-transform to a multi-marginal cases. This indicates the existence of a class of conditioned SDEs that depend on future observations, where the solutions of these SDEs lead to the true posterior path measure within the framework of CD-SSM. • We reformulate the simulation of conditioned SDEs as a SOC problem to approximating an impractical Doob's h-transform. By leveraging the connection between SOC theory and Doob's h-transform, we propose a variational inference algorithm with a tight ELBO. • For practical real-world applications, we introduce an efficient and scalable modeling approach that enables parallelization of latent dynamic simulation and ELBO computation. • We demonstrate its superior performance across various real-world irregularly sampled timeseries tasks, including per-time point classification, regression, and sequence interpolation and extrapolation, all with computational efficiency.
Notation Throughout this paper, we denote path measure by P (•) , defined on the space of continuous functions Ω = C([0, T ], R d ). We sometimes denote with P the expectation as E t,x
P [•] = E P [•|X t = x],
where the stochastic processes corresponding to P (•) are represented as X (•) and their timemarginal distribution at time t ∈ [0, T ] is given by the push-forward measure µ (•)
t := (X (•) t ) # P (•) with marginal density p (•)
t . This marginal density represents the Radon-Nikodym derivative dµ
(•) t (x) = p (•)
t (x)dL(x), where L denotes the Lebesgue measure. Additionally, for a function V : [0, T ]×R d → R, we define the first and second derivatives with respect to x ∈ R d as ∇ x V and ∇ xx V, respectively, and the derivative with respect to time t ∈ [0, T ] as ∂ t V. For a sequence of functions {V i } i∈[1:k] , we will denote V i (t, x) := V i,t and [1 : k] = {1, • • • , k}. Finally, the Kullback-Leibler (KL) divergence between two probability measures µ and ν is defined as D KL (µ|ν) = R d log dµ dν (x)dµ(x) when µ is absolutely continuous with respect to ν, and D KL (µ|ν) = +∞ otherwise. continuous-time Markov state trajectory X 0:T in latent space R d is given as a solution of the SDE:
(Prior State) dX t = b(t, X t )dt + dW t ,(1)
where X 0 ∼ µ 0 and {W t } t∈[0,T ] is a R d -valued Wiener process that is independent of the µ 0 . Since X t is Markov process, the time-evolution of marginal distribution µ t is governed by a transition density, which is the solution to the Fokker-Planck equation assocaited with X t . This allows us to define a path measure P that represent the weak solutions of the SDE in (1) over an interval [0, T ]1 .
For a measurement model g i (y ti |X ti ), we consider the case that we have only access to the realization of the (latent) observation process at each discrete-time stamps {t i } i∈[1:k] , i.e., y ti ∼ g i (y ti |X ti ), ∀i ∈ [1 : k]. In this paper, our goal is to infer the classes of SDEs which inducing the filtering/smoothing path measure P ⋆ := P ⋆ (•|H t k ), the conditional distribution over the interval [0, T ] for a given P and a set of observations up to time t k , H t k = {y ti |i ≤ k}:
(Posterior Dist.) dP ⋆ (X 0:T |H t k ) = 1 Z(H t k ) K i=1 g i (y ti |X ti )dP(X 0:T )(2)
where the normalizing constant Z(H t k ) = E P K i=1 g i (y ti |X ti ) serve as a observations likelihood. The path measure formulation of the posterior distribution described in (2) referred to as Feynman-Kac models. See (Del Moral, 2011;Chopin et al., 2020) for a more comprehensive understanding.

Section: CONTROLLED CONTINUOUS-DISCRETE STATE SPACE MODEL
In this section, we introduce our proposed model, ACSSM. First, we present the Multi-marginal Doob's h-transform, outlining the continuous dynamics for P ⋆ in Section 3.1. Then, in Section 3.2, we frame the VI for approximating P ⋆ using SOC. To support scalable real-world applications, we discuss efficient modeling and amortized inference in Section 3.32 .

Section: MULTI MARGINAL DOOB'S h-TRANSFORM
Before applying VI to approximate the posterior distribution P ⋆ in SOC, we first show that a class of SDEs exists whose solutions induce a path measure equivalent to P ⋆ in (2). This formulation provides a valuable insight for defining an appropriate objective function for the SOC problem in the next section. To do so, we first define a sequence of normalized potential functions {f i } i∈ [1:k] , where each f i : R d → R + , for all i ∈ [1 : k],
f i (y ti |x ti ) = g i (y ti |x ti ) L i (g i ) ,(3)
where L i (g i ) = R d g ti (y ti |x ti )dP(x 0:T ), for all i ∈ [1 : k] is the normalization constant. Then, we can observe that the potential functions {f i } i∈[1:k] defined in (3) satisfying the normalizing property i.e., E P [ k i=1 f i (y ti |x ti )] = 1 and dP ⋆ (x 0:T |H t k ) = k i=1 f i (y ti |x ti )dP(x 0:T ) from (2). Now, with the choice of reference measure P induced by Markov process in (1), we can define the conditional SDEs conditioned on H t k which inducing the desired path measure P ⋆ . Note that this is an extension of the original Doob's h-transform (Doob, 1957), incorporating multiple marginal constraints. Below, we summarize the relevant result. Theorem 3.1 (Multi-marginal Doob's h-transform). Let us define a sequence of functions
{h i } i∈[1:k] , where each h i : [t i-1 , t i ) × R d → R + , for all i ∈ [1 : k], is a conditional expectation h i (t, x t ) := E P k j≥i f j (y tj |X tj )|X t = x t , where {f i } i∈[1:k] is defined in (3). Now, we define a function h : [0, T ] × R d → R + by integrating the functions {h i } i∈[1:k] , h(t, x) := k i=1 h i (t, x)1 [ti-1,ti) (t).
(4)
Then, with the initial condition µ ⋆ 0 (dx 0 ) = h 1 (t 0 , x 0 )µ 0 (dx 0 ), the solution of the following conditional SDE inducing the posterior path measure P ⋆ in (2):
(Conditioned State) dX ⋆ t = [b(t, X ⋆ t ) + ∇ x log h(t, X ⋆ t )] dt + dW t (5)
Theorem 3.1 demonstrates that we can obtain sample trajectories from P ⋆ in (2) by simulating the dynamics in (5). However, estimating the functions {h i } i∈[1:k] requires both the estimation of the sequence of normalization constants {L i } i∈[1:k] and the computation of conditional expectations, which is infeasible in general. For these reasons, we propose a VI algorithm to approximate the functions {h i } i∈[1:k] and derive the variational bound for the VI by exploiting the theory of SOC.

Section: STOCHASTIC OPTIMAL CONTROL
The SOC (Fleming & Soner, 2006;Carmona, 2016) is a mathematical framework that deals with the problem of finding control policies in order to achieve certain object. In this paper, we define following control-affine SDE, adjusting the prior dynamics in (1) with a Markov control α :
[0, T ] × R d → R d : (Controlled State) dX α t = [b(t, X α t ) + α(t, X α t )] dt + d Wt ,(6)
where X α 0 ∼ µ 0 . We refer to the SDE in (6) as controlled SDE. We can expect that for a welldefined function set α ∈ A, the class of controlled SDE (6) encompass the SDE in (5). This implies that the desired path measure P ⋆ can be achieved through the SOC formulation. In general, the goal of SOC is to find the optimal control policy α ⋆ that minimizes a given arbitrary cost function J (t, x t , α) i.e., α ⋆ (t, x t ) = arg min α∈A J (t, x t , α) and determine the value function V(t, x t ) = min α∈A J (t, x t , α), where V(t, x t ) := J (t, x t , α ⋆ ) ≤ J (t, x t , α) holds for any α ∈ A. Below, we demonstrate how, with a carefully chosen cost function, the theory of SOC establishes a connection between two classes of SDEs ( 5) and ( 6). This connection enables the development of a variational inference algorithm with a tight evidence lower bound (ELBO). To this end, we consider the following cost function:
J (t, x t , α) = E P α   T t 1 2 ∥α(s, X α s )∥ 2 ds - i:{t≤ti} log f i (y ti |X α ti )|X α t = x t   .(7)
Then, the value function V for (7) satisfies the dynamic programming principle (Carmona, 2016): Theorem 3.2 (Dynamic Programming Principle). Let us consider a sequence of left continuous functions {V i } i∈[1:k+1] , where each
V i ∈ C 1,2 ([t i-1 , t i ) × R d ) V i (t, x t ) := min α∈A E P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )|X t = x t ,(8)
for all i ∈ [1 : k] and V k+1 = 0. Then, for any 0 ≤ t ≤ u ≤ T , the value function V for the cost function in (7) satisfying the recursion defined as follows:
V(t, x t ) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   ,(9)
with the indexing function
I(u) = max{i ∈ [1 : k]|t i ≤ u} and f i = f i (y ti |X α ti )
The value functions presented in Theorem 3.2 suggest that the objective of the optimal control policy α for the interval [t i-1 , t i ) is not just to minimize the negative log-potential -log f i (y ti |•) for the immediate observation y ti . Instead, it also involves considering future costs {V j } j∈[i+1:k] and the corresponding future observations {y tj } j∈[i+1:k] , since V i follows a recursive structure. Since our goal is to approximate P ⋆ in (5), it is natural that the optimal control policy α ⋆ should reflect the future observations {y tj } j∈[i+1:k] , as the h-function inherently does. Next, we will derive the explicit form of the optimal control policy. Theorem 3.3 (Verification Theorem). Suppose there exist a sequence of left continuous functions
V i (t, x) ∈ C 1,2 ([t i-1 , t i ), R d ), for all i ∈ [1 : k],
satisfying the following Hamiltonian-Jacobi-Bellman (HJB) equation:
∂ t V i,t + A t V i,t + min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = 0, t i-1 ≤ t < t i (10) V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x), t = t i , ∀i ∈ [1 : k],(11)
where a minimum is attained by α
⋆ i (t, x) = ∇ x V i (t, x). Now, define a function α : [0, T ]×R d → R d by integrating the optimal controls {α i } i∈[1:k] , α ⋆ (t, x) := k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t) (12) Then, V(t, x t ) = J (t, x t , α ⋆ ) ≤ J (t, x t , α) holds for any (t, x t ) ∈ [0, T ] × R d and α ∈ A.
Note that the optimal control (12) for the cost function ( 7) share a similar structure as in (4). Since the theory of PDEs establishes a fundamental link between various classes of PDEs and SDEs (Richter & Berner, 2022;Berner et al., 2024), it allows us to reveal the inherent connection between ( 12) and ( 4), thereby enable us to simulate the conditional SDE (5) in an alternative way. Lemma 3.4 (Hopf-Cole Transformation). The h function satisfying the following linear PDE:
∂ t h i,t + A t h i,t = 0, t i-1 ≤ t < t i (13) h i (t i , x) = f i (y ti |x)h i+1 (t i , x), t = t i , ∀i ∈ [1 : k].(14)
Moreover, for a logarithm transformation V = -log h, V satisfying the HJB equation in (10)(11).
According to Lemma 3.4, the solution of linear PDE in (13-14) is negative exponential to the solution of the HJB equation in (10)(11). Therefore, it leads to the following corollary: Corollary 3.5 (Optimal Control). For optimal control α ⋆ induced by the cost function ( 7) with dynamics (6), it satisfies α ⋆ = ∇ x log h. In other words, we can simulate the conditional SDEs in ( 5) by simulating the controlled SDE (6) with optimal control α ⋆ .
Corollary 3.5 states that the Markov process induced by α ⋆ and ∇ x log h is equivalent under same initial condition. However, comparing the P α induced by the controlled dynamics in ( 6) with an initial condition µ 0 , the conditioned dynamics ( 5) has an intial condition µ ⋆ 0 to inducing the desired path measure P ⋆ . In other words, although we find the optimal control α ⋆ , the constant discrepancy between µ 0 and µ ⋆ 0 still remain, thereby keeping P ⋆ and P α ⋆ misaligned. Fortunately, a surrogate cost function can be derived from the variational representation under the KL-divergence, allowing us to find the optimal control while minimizing the discrepancy. Theorem 3.6 (Tight Variational Bound). Let us assume that the path measure P α induced by (6) for any α ∈ A satisfies D KL (P α |P ⋆ ) < ∞. Then, for a cost function J in (7) and µ ⋆ 0 in (5), it holds:
D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] = L(α) + log Z(H t k ) ≥ 0, (15
)
where the objective function L(α) (a negative ELBO) is given by:
L(α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti ) ≥ -log Z(H t k ).(16)
Moreover, assume that L(α) has a global minimum α ⋆ = arg min α∈A L(α). Then the equality holds in (16) i.e., L(α ⋆ ) = -log Z(H t k ) and µ ⋆ 0 = µ 0 almost everywhere with respect to µ 0 .
Theorem 3.6 suggests that the optimal control α ⋆ = arg min α L(α) provides the tight variational bound for the likelihood functions {g i } i∈[1:k] and the prior path measure P induced by (1). Furthermore, α ⋆ ensures that µ ⋆ 0 = µ 0 , indicating that simulating the controlled SDE in (6) with α ⋆ and initial condition µ 0 generates trajectories from the posterior path measure P ⋆ in (2). In practice, the optimal control α ⋆ can be approximated using a highly flexible neural network, which serves as a function approximator (i.e., α := α(•; θ)), optimized through gradient descent-based optimization (Li et al., 2020;Zhang & Chen, 2022;Vargas et al., 2023).
However, applying gradient descent-based optimization necessitates computing gradients through the simulated diffusion process over the interval [0, T ] to estimate the objective function (16) for neural network training. This approach can become slow, unstable, and memory-intensive as the time horizon or dimension of latent space increases (Iakovlev et al., 2023;Park et al., 2024). It contrasts with the philosophy of many recent generative models (Ho et al., 2020;Song et al., 2020), which aim to decompose the generative process and solve the sub-problems jointly. Additionally, for inference, it requires numerical simulations such as Euler-Maruyama solvers (Kloeden & Platen, 2013), which can also be time-consuming for a long time series. It motivated us to propose an efficient and scalable modeling approach for real-world applications described in the next section. 

Section: EFFICIENT MODELING OF THE LATENT SYSTEM
The linear approximation of the general drift function provides a simulation-free property for the dynamics and significantly enhances scalability while ensuring high-quality performance (Deng et al., 2024). Motivate by this property, we investigate a class of linear SDEs to improve the efficiency of the proposed controlled dynamics, ensuring superior performance compared to other baselines. We introduce the following affine linear SDEs:
dX t = [-AX t + α] dt + dW t , where X 0 ∼ N (m 0 , Σ 0 ),(17)
and a matrix A ∈ R d×d and a vector α ∈ R d . The solutions for X t in (17) has a closed-form Gaussian distribution for any t ∈ [0, T ], where the mean m t and covariance Σ t can be explicitly computed by solving the ODEs (Särkkä & Solin, 2019):
m t = e -At m 0 -A -1 (e -At -I)α(18)
Σ t = e -At Σ 0 e -A ⊤ t + t 0 e -A(t-s) e -A ⊤ (t-s) ds.(19)
However, calculating the moments m t and Σ t in (18) involves computing matrix exponentials, inversions, and performing numerical integration. These operations can be computationally intensive, especially for large matrices or when high precision is required. These computations can be simplifies by restricting the matrix A to be a diagonal or semi-positive definite (SPD). Remark 3.7 (Diagonalization). Since SPD matrix A admits the eigen-decomposition A = EDE ⊤ with E ∈ R d×d and D ∈ diag(R d ), the process X t expressed in a standard basis can be transformed to a Xt which have diagonalized drift function. In the space spanned by the eigen-basis E, the dynamics in (1) can be rewritten into:
d Xt = -D Xt + α dt + d Ŵt , where X0 ∼ N ( m0 , Σ0 ),(20)
Xt = E ⊤ X t , α = E ⊤ α, Ŵt = E ⊤ W t , mt = E ⊤ m t and Σt = E ⊤ Σ t E. Note that Ŵt d = E ⊤ W t for any t ∈ [0, T ]
due to the orthonormality of E, so Ŵt can be regarded as a standard Wiener process. Because of D = diag(λ), where λ = {λ 1 , • • • , λ d } and each λ i ≥ 0 for all i ∈ [1 : d], we can obtain the state distributions of X t for any t ∈ [0, T ] by solving ODEs in (18-19) analytically, without the need for numerical computation. The results are then transforming back to the standard basis i.e.,
m t = E mt , Σ t = E Σt E ⊤ .
Locally Linear Approximation To leverage the advantages of linear SDEs in (17) which offer simulation-free property, we aim to linearize the drift function in (6). However, the naïve formulation described in (17) may limit the expressiveness of the latent dynamics for real-world applications. Hence, we introduce a parameterization strategy inspired by (Becker et al., 2019;Klushyn et al., 2021a), which leverage neural networks to enhance the flexibility by fully incorporating an attentive structure with a given observations y 0:T while maintaining a linear formulation:
dX α t = [-A t X t + α t ] dt + d Wt ,(21
) where the approximated drift function is constructed as affine formulation with following components:
A t = L l=1 w (l) θ (z t )A (l) , α t = B θ z t . (22
)
Figure 2: Two types of information assimilation schemes. The matrix A t is given by a convex combination of L trainable base matrices {A (l) } l∈[1:L] , where the weights w θ = softmax(f θ (z t )) are produced by the neural network f θ . Additionally, B θ ∈ R d×d is a trainable matrix. The latent variable z t is produced by the transformer T θ , which encodes the given observations y 0:T , depending on the task-specific information assimilation scheme.
Figure 2 illustrates two assimilation schemes using masked attention mechanism3 : the history assimilation scheme, the transformer T θ encodes information up to the current time t and outputs z t , i.e., z t = T θ (H t ), and the full assimilation scheme, the transformer T θ encodes information over the entire interval [0, T ] and outputs z t , i.e., z t = T θ (H T ). Note that this general formulation brought from the control formulation enables more flexible use of information encoded from observations. In contrast, previous Kalman-filtering based CD-SSM method (Schirmer et al., 2022) relies on recurrent updates, which limits them to typically using historical information only. that admit the eigen-decomposition
A i = ED i E ⊤ with E ∈ R d×d and D i ∈ diag(R d ) ⪰ 0 for all i ∈ [1 : k], control vectors {α i } i∈[1:k]
and following control-affine SDEs for all i ∈ [1 : k]:
dX t = [-A i X t + α i ] dt + σdW t , t ∈ [t i-1 , t i ).
(23) Then, with X 0 ∼ N (m 0 , Σ 0 ), the solution of ( 23) is a Gaussian process N (m ti , Σ ti ) with:
m ti = E e -i j=1 (tj -tj-1)Dj mt0 - i k=1 e -i j=k (tj -tj-1)Dj D -1 k I -e (t k -t k-1 )D k αk , Σ ti = E e -2 i j=1 (tj -tj-1)Dj Σt0 - 1 2 i k=1 e -2 i j=k (tj -tj-1)Dj D -1 k I -e 2(t k -t k-1 )D k E ⊤ , where mti = E ⊤ m ti , Σti = E ⊤ Σ ti E and αi = E ⊤ α i for all i ∈ [1 : k].
Parallel Computation Given an associative operator ⊗ and a sequence of elements
[s t1 , • • • s t K ],
the parallel scan algorithm computes the all-prefix-sum which returns the sequence
[s t1 , (s t1 ⊗ s t2 ), • • • , (s t1 ⊗ s t2 ⊗ • • • ⊗ s t K )](24
) in O(log K) time. Leveraging the linear formulation described in Theorem 3.8 and the inherent parallel nature of the transformer architecture for sequential structure, our method can be integrated with the parallel scan algorithm (Blelloch, 1990) resulting efficient computation of the marginal Gaussian distributions by computing both moments {m t } t∈[0,T ] and {Σ t } t∈[0,T ] in a parallel sense 4 . Remark 3.9 (Non-Markov Control). Note that (21-22) involves approximating the Markov control by a non-Markov control α(H t ) := α θ , parameterized by neural network θ. However, Theorem 3.3 establishes that the optimal control should be Markov, as it is verified by the HJB equation (Van Handel, 2007). In our case, we expect that with a high-capacity neural network θ, the local minimum θ M , obtained after M gradient descent steps
θ m+1 = θ m -∇ θ L(α θ m ) yields L(α ⋆ ) ≈ L(α θ m →θ M ).
Auxiliary Variable Moreover, to enhance flexibility, we treat y 0:T as an auxiliary variable in the latent space which is produced by a neural network encoder q ϕ applied to the given time series data o 0:T i.e., y 0:T ∼ q ϕ (y 0:T |o 0:T ), where it is factorized as
q ϕ (y 0:T |o 0:T ) = k i=1 q ϕ (y ti |o ti ) = k i=1 N (y ti |q ϕ (o ti ), Σ q ) (25)
with a fixed variance Σ q . Additionally, it enables the modeling of nonlinear emission distributions through a neural network decoder p ψ (o 0:T |y 0:T ), where it is factorized as
p ψ (o 0:T | y 0:T ) = k i=1 p ψ (o ti | y ti ),(26)
with the likelihood function p ψ depending on the task at hand. This formulation first maps the time series o 0:T into a suitable low-dimensional space y 0:T , allowing more efficient modeling of the latent dynamics X α 0:T . The information capturing the underlying physical dynamics resides in a much lowerdimensional space compared to the original sequence (Fraccaro et al., 2017). Therefore, performing generative modeling in this reduced latent space (rather than directly in the high-dimensional domain, e.g., pixel values in an image sequence) offers greater flexibility in parameterization.

Section: TRAINING AND INFERENCE
Training Objective Function We jointly train, in an end-to-end manner, the amortization parameters {ϕ, ψ} for the encoder-decoder pair, along with the parameters of the latent dynamics θ = {f θ , B θ , T θ , m 0 , Σ 0 , {A (l) } l∈[1:L] }, which include the parameters required for controlled latent dynamics. The training is achieved by maximizing the evidence lower bound (ELBO) of the observation log-likelihood for a given the time series o 0:T :
log p ψ (o 0:T ) ≥ E H T ∼q ϕ (y 0:T |o 0:T ) log K i=1 p ψ (o ti |y ti )g(y 0:T ) K i=1 q ϕ (y ti |o ti ) (27) ≥ E H T ∼q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(θ) = ELBO(ψ, ϕ, θ)(28)
Since Z(H t k ) = g(y 0:T ), the prior over auxiliary variable g(y 0:T ) can be computed using the ELBO L(θ) proposed in ( 16) as part of our variational inference procedure for the latent posterior P ⋆ in proposed Sec 2, including all latent parameters θ containing the control α θ . Note that our modeling is computationally favorable for both training and inference, since estimating marginal distributions in a latent space can be parallelized. This allows for the generation of a latent trajectory over the entire interval without the need for numerical simulations. The overall training and inference processes are summarized in the Algorithm 3 and Algorithm 4 in Appendix, respectively.

Section: EXPERIMENT
In this section, we present empirical results demonstrating the effectiveness of ACSSM in modeling real-world irregular time-series data. The primary objective was to evaluate its capability to capture the underlying dynamics across various datasets. To demonstrate the applicability of the ACSSM, we conducted experiments on four tasks: per-time regression/classification and sequence interpolation/extrapolation, using four datasets: Human Activity, USHCN (Menne et al., 2015), and Physionet (Silva et al., 2012). We compare our approach against various baselines including RNN architecture (RKN-∆ t (Becker et al., 2019), GRU-∆ t (Chung et al., 2014), GRU-D (Che et al., 2018)) as well as dynamics-based models (Latent-ODE (Chen et al., 2018;Rubanova et al., 2019), ODE-RNN (Rubanova et al., 2019), GRU-ODE-B (De Brouwer et al., 2019), CRU (Schirmer et al., 2022), Latent-SDE H (Zeng et al., 2023)) and attention-based models (mTAND (Shukla & Marlin, 2021)), which have been developed for modeling irregular time series data. We reported the averaged results over five runs with different seed. The best results are highlighted in bold, while the second-best results are shown in blue. Additional experimental details can be found in Appendix D. (Zeng et al., 2023).
4.1 PER TIME POINT CLASSIFICATION & REGRESSION Table 1: Test Accuracy (%). Model Acc Latent-ODE † 87.0 ± 2.8 Latent-SDEH ‡ 90.6 ± 0.4 mTAND † 91.1 ± 0.2 ACSSM (Ours) 91.4 ± 0.4 † result from (Shukla & Marlin, 2021). ‡ result from
Human Activity Classification We investigated the classification performance of our proposed model. For this purpose, we trained the model on the Human Activity dataset, which contains timeseries data from five individuals performing various activities such as walking, sitting, lying, and standing etc. The dataset includes 12 features in total, representing 3D positions captured by four sensors attached to the belt, chest, and both ankles. Following the pre-processing approach proposed by (Rubanova et al., 2019), the dataset comprises 6,554 sequences, each with 211 time points. The task is to classify each time point into one of seven activities.
Table 1 reports the test accuracy, showing that ACSSM outperforms all baseline models. We employed the full assimilation scheme to maintain consistency with the other baselines, which infer the latent state using full observation. It is important to note that the two dynamical models, Latent-ODE and Latent-SDE H , incorporate a parameterized vector field in their latent dynamics, thereby rely on numerical solvers to infer intermediate states. Besides, mTAND is an attention-based method that does not depend on dynamical or state-space models, thus avoiding numerical simulation. We believe the significant performance improvement of our model comes from its integration of an attention mechanism into dynamical models. Simulation-free dynamics avoid numerical simulations while maintaining temporal structure, leading to more stable learning.  (Schirmer et al., 2022). ‡ result from (Zeng et al., 2023). ⋆ result from (Smith et al., 2023).
Pendulum Regression Next, we explored the problem of sequence regression using pendulum experiment (Becker et al., 2019), where the goal is to infer the sine and cosine of the pendulum angle from irregularly observed, noisy pendulum images (Schirmer et al., 2022).
To assess our performance, we compared it with previous dynamicsbased models, reporting the regression MSE on a held-out test set as shown in Table 2. We employed the full assimilation scheme. The experimental results demonstrated that our proposed method outperformed existing models, delivering superior performance. These findings highlight that, even when linearizing the drift function, the amortization and proposed neural network based locally linear dynamics in Sec 3.3 preserves the expressivity of our approach, enabling more accurate inference of non-linear systems.
Moreover, particularly in comparison to CRU, we believe that the significant performance improvement stems from the fundamental differences in how information is leveraged. To infer intermediate angular values, utilizing not only past information but also future positions of the pendulum can enhance the accuracy of these predictions. In this regard, while CRU relies solely on the past positions of the pendulum for its predictions, our model is capable of utilizing both past and future positions. to infer all time points t ∈ T ′ based on a subset of observations o t∈T where T ⊆ T ′ . For the interpolation task, the encoded observations y t∈T were assimilated by using the full assimilation scheme for the construction of the accurate smoothing distribution. The interpolation results presented in Table 3, where we report the test MSE evaluated on the entire time points T ′ . For all datasets, ACSSM outperforms other baselines in terms of test MSE. It clearly indicate the expressiveness of the ACSSM, trajectories X α t∈T ′ sampled from approximated path measure over the entire interval T ′ are contain sufficient information for generating accurate predictions.
Extrapolation We evaluated ACSSM's performance on the extrapolation task following the experimental setup of Schirmer et al. (2022). Each model infer values for all time stamps t ∈ T ′ , where T ′ denotes the union of observed time stamps T = {t i } i∈[1:k] and unseen time stamps
T u = {t i } i∈[k+1:N ] , i.e., T ′ = T ∪ T u .
For the Physionet dataset, input time stamps T covered the first 24 hours, while target time stamps T ′ spanned the rest hours. In the USHCN dataset, the timeline was split evenly, with t k = N 2 . We report the test MSE for unseen time stamps T u = T ′ -T based on the observations on time stamps T . For modeling an accurate filtering distribution, we employed the history assimilation scheme for this task. As illustrated in Table 3, ACSSM consistently outperformed all baseline models in terms of MSE on the USHCN dataset, achieving a significant performance gain over the second-best model. For the Physionet dataset, ACSSM exhibited comparable performance.
Computational Efficiency To evaluate the training costs in comparison to dynamics-based models that depend on numerical simulations, we re-ran the CRU model on the same hardware used for training our model, indicated by * in Table 3. Specifically, we utilized a single NVIDIA RTX A6000 GPU. As illustrated in Table 3, ACSSM significantly lowers training costs compared to dynamicsbased models. Notable, ACSSM demonstrated a runtime that was more than 16-25× faster than CRU, even though both models aim to approximate P ⋆ as well. It highlight that the latent modelling approach discussed in Sec 3.3 improves efficiency while achieving an accurate approximation of P ⋆ .

Section: CONCLUSION AND LIMITATION
In this work, we proposed the method for modeling time series with irregular and discrete observations, which we called ACSSM. By using a multi-marginal Doob's h-transform and a variational inference algorithm by exploiting the theory of SOC, ACSSM efficiently simulates conditioned dynamics. It leverages amortized inference, a simulation-free latent dynamics framework, and a transformer-based data assimilation scheme for scalable and parallel inference. Empirical results show that ACSSM outperforms in various tasks such as classification, regression, and extrapolation while maintaining computational efficiency across real-world datasets.
Although we present the theoretical basis of our method, a thorough analysis in followings remain an open challenge. The variational gap due to the linear approximation may lead to cumulative errors over time, which requires further examination specially for a long time-series such as LLM. Unlike SMC variants (Heng et al., 2020;Lu & Wang, 2024) that use particle-based importance weighting to mitigate approximation errors, ACSSM depends on high-capacity neural networks to accurately approximate the optimal control. Incorporating multiple controls, akin to a multi-agent dynamics approach (Han & Hu, 2020), may alleviate these challenges by enhancing flexibility and robustness.

Section: REPRODUCIBILITY STATEMENT
On the theoretical part, all proofs and assumptions are left to Appendix B due the space constraint.
The training and inference algorithms are detailed in Algorithm 3 and Algorithm 4, respectively. Additional implementation details, such as data preprocessing are included in Appendix D. We believe these details are sufficient for interested readers to reproduce the results.

Section: ETHICS STATEMENT
In this work, we proposed a method for modeling irregular time series for practical applications, suggesting that ACSSM does not directly influence ethical or societal issues in a positive or negative way. However, because ACSSM can be applied to health-care datasets, we believe it has the potential to benefit society by improving health and well-being of people.

Section: A BRIEF REVIEWS ON THE KEY CONCEPTS AND RELATED WORKS
In this section, we provide a brief overview of key concepts to help clarify the foundation of our proposed method. Additionally, we review related works to offer a deeper understanding.
Probabilistic SSMs Bayesian filtering and smoothing (Särkkä, 2013) serve as fundamental state estimation techniques for probabilistic SSMs. Formally, SSMs can be defined as follows:
(Latent transition) X ti ∼ p i (x ti-1 , dx ti ), X 0 ∼ p 0 (X 0 ) (Observation) y ti ∼ g ti (y ti |X ti ).
(29) The SSMs consist of the R d valued latent variable {X t } t≥0 , which are assumed to follow a timeinhomogeneous Markov process with Markov transition densities {p i } i∈[1:k] , and the observations {y ti } i∈[1:k] are assumed to be generated from an observation (emission) density g i (•|X ti ). Then, the goal is to obtain the filtering/smoothing distribution, for given observations {y ti } i∈[1:k] , given by:
p(X 0:T |H t k ) = 1 Z(H t k ) p(H t k |X 0:T )p(X 0:T ),(30)
where p(X 0:T ) is the prior distribution and Z(H t k ) is a normalization constant, defined as:
Z(H t k ) = p(H t k |X 0:T )p 0 (X 0 ) k i=1 p i (X i-1 , X i )dX 0:T . (31
)
To obtain the filtering/smoothing distribution in (30), it typically relies on recursive Bayesian updates, which scale proportionally with the number of observations, resulting in a computational complexity of O(k) in this context. Previous works (Doerr et al., 2018;Becker et al., 2019;Klushyn et al., 2021b) have proposed RNN-based approximate Bayesian inference methods. However, these models generally assume evenly spaced observations, making it challenging to accurately model irregular time series. On the other hand, deterministic linear SSMs such as S4 (Gu et al., 2021), S5 (Smith et al., 2023), and Mamba (Gu & Dao, 2023) have been introduced, demonstrating improved inference efficiency through parallel computing algorithms.
In contrast, we propose a probabilistic SSM that enables efficient and powerful modeling of latent systems. Our approach supports parallel computation, leading to significant gains in both training and inference efficiency. Specifically, we incorporate a parallel scan algorithm (Blelloch, 1990) into probabilistic inference, effectively reducing the computational complexity from O(k) to O(log k).

Section: Twist function for Conditioned SSMs
To sequentially sample from the smoothing distribution (30), i.e., p(X t |H t k ) over all t ∈ [0, T ] (referred to here as the conditioned sampling problem), it is necessary to compute the marginal distribution p(X t |H t k ). However, this involves an expectation over the marginalized distribution p(X t |H t k ) = p(X 0:T |H t k )dX t:T , which is generally intractable. Fortunately, the distribution p(X t |H t k ) can be factorized as (Chopin et al., 2020):
p(X t |H t k ) ∝ p(X t |H t ) p(H t:t k |X t:T )dX t:T (32) = p(H t |X t )p(X t ) p(H t:t k |X t:T )dX t:T .(33)
Thus, by approximating the term ψ(X t ) := p(H t:t k |X t:T )dX t:T , the conditioned sampling problem can be effectively addressed. Here, the intractable term ψ, often referred to as a twist function, has been the focus of various approximation algorithms proposed in the Sequential Monte Carlo (SMC) literature (Guarniero et al., 2017;Heng et al., 2020;Lawson et al., 2022;Lu & Wang, 2024). Recently, these methods have also been adapted to large language models (LLMs) (Zhao et al., 2024), further demonstrating their versatility in solving controlled language generation problems.
It is worth noting that our h-function defined in (4) serves as an instance of such a twist function within the Feynman-Kac representation. Essentially, it enables us to establish a connection between twisted SSMs and the multi-marginal Doob's h-transform.

Section: Feynman-Kac Models
The Feynman-Kac model provides substantial expressive advantages for analyzing SSMs (Chopin et al., 2020). By offering a flexible measure-theoretic framework, it allows for efficient representation of conditioned SSMs in (30) through the corresponding Feynman-Kac formulae (Del Moral, 2011). Notably, the posterior distribution in (30) can be expressed via a Feynman-Kac model, as described in (2). In the machine learning literature, the Feynman-Kac model has been utilized for abstracting processes such as LLM fine-tuning (Lew et al., 2023) and diffusion based sampler (Phillips et al., 2024). Furthermore, we extend the Feynman-Kac model to continuous settings for time-series modeling. For a more in-depth understanding, refer to (Chopin et al., 2020).
Continuous Dynamical Models To accurately model irregular time-series data, neural differential equation families have been proposed as an effective paradigm. This is because the latent, continuous dynamics underlying physical time-series can be well-approximated using data-driven methods with parameterized vector fields. Specifically, Neural ODE (Chen et al., 2018) introduced neural network parameterized vector fields for continuous-time dynamical modeling of time-series data.
Building on this, Latent-ODE (Rubanova et al., 2019) proposed latent dynamics by encoding the entire dataset into an initial state, ODE-RNN as a continuous encoder alternative to standard RNNs. GRU-ODE-B (De Brouwer et al., 2019) incorporated Bayesian principle into Neural ODEs to enable online updates for new observations. On the stochastic dynamics side, Latent-SDE (Li et al., 2020) introduced a variational bound for posterior inference in SDEs, while Latent-SDE H (Zeng et al., 2023) focused on stochastic dynamics evolving within a homogeneous latent space. These neural differential equations generally rely on numerical simulations with dynamic solvers, which can result in significant computation times for both training and inference.
To model continuous time-series using probabilistic SSMs, CD-SSM (Jazwinski, 2007) extends the discrete transitions of latent variables to stochastic transitions governed by SDEs. This approach has inspired neural network-based CD-SSM methods (Schirmer et al., 2022;Ansari et al., 2023) for handling irregularly sampled real-world time-series datasets. Compared to our approach, while Schirmer et al. ( 2022); Ansari et al. (2023) also leverage locally linear dynamics to improve scalability, they still require numerical approximations to infer Gaussian moments, which limits their ability to fully utilize parallel computation. In contrast, our method successfully leverages parallel computation, as demonstrated by Theorem 3.8, significantly reducing both training and inference costs in conditioned state-space modeling. Moreover, our SOC formulation with approximated control α offers flexibility in sequential modeling, allowing for various information assimilation schemes utilizing powerful neural network architectures, such as transformers.
Doob's h-transform for Conditioned SDEs In contrast to conditioned SSMs using a twist function for discrete transitions, we propose a conditioned SDEs for continuous transitions to model irregular time-series data. This involves extending the traditional Doob's h-transform to multi-marginal settings. Specifically, the Doob's h-transform (Doob, 1957;Rogers & Williams, 2000;Chetrite & Touchette, 2015) is a technique in probability theory that modifies the behavior of a Markov process, effectively conditioning it to reach a desired state or outcome. It can be understood as reweighting the paths of a stochastic process to make certain events more probable.
In the machine learning context, the Doob's h-transform has been applied to tasks such as diffusion models (Ye et al., 2022;Liu et al., 2023;Peluchetti, 2023;Shi et al., 2024;Denker et al., 2024;Park et al., 2024), simulating diffusion bridges (Heng et al., 2021;Baker et al., 2024), posterior approximations (Park et al., 2024), online filtering (Chopin et al., 2023). However, prior works typically focus on single-marginal cases, where for a finite time horizon [0, T ], the conditioning is solely on a terminal constraint P(X T ∈ A) for some set A ∈ B(X ).
One of our contribution is extends this concept to multi-marginal cases, capturing the continuous dynamics of the posterior distribution conditioned on a collection of observations. Specifically, in Theoren 3.1,we define (multi-marginal) conditional SDEs where the constraint is given by a set of marginals, P(X t1 ∈ A 1 , . . . ,
X t k ∈ A k ) for any A i ∈ B(R d ) and i ∈ [1 : k].
Since inferring the corresponding h-function in these multi-marginal settings is more complex, we reformulate this challenge as a SOC problem. This reformulation also requires extensions of existing theoretical results, such as Theorems 3.2 and 3.3. Additionally, we establish a tight variational bound in Theorem 3.6 demonstrating that our proposed multi-marginal Doob's h-transform can be efficiently approximated using the proposed SOC objective.

Section: B PROOFS AND DERIVATIONS
In this section, we present the proofs and derivations for all relevant theorems, lemmas, and corollaries We first restate the core concepts of stochastic calculus, which will be used without further explanation.
Assumptions. Throughout the paper, we work with a probability space (Ω, F, {F t } t∈[0,T ] , P), where the filtration {F t } t∈[0,T ] supports an R d -valued F t -adapted Wiener process W t for all t ∈ [0, T ]. It is important to note that the P-null set is included in F 0 , indicating that any event with probability zero at time 0 is measurable in the initial σ-algebra.
We assume that b and α satisfy following conditions:
• (Lipschitz condition): For any t ∈ [0, T ], w ∈ Ω, and x, x ′ ∈ R d , where c 0 > 0 is a Lipschitz constant, the functions satisfy the inequality
|b(t, w, x) -b(t, w, x ′ )| ≤ c 0 |x -x ′ |.
• (Linear growth condition): For every
x ∈ R d , the F t -progressively measurable processes b(t, x) t∈[0,T ] satisfy E T 0 |b s | 2 ds < ∞ and |b(t, x)| ≤ c 1 (1 + |x|) for t ∈ [0, T ] and c 1 > 0. • (Control function): For any t ∈ [0, T ], w ∈ Ω, x ∈ R d , and θ, θ ′ ∈ Θ, the control function α is L- Lipschitz function, |α(t, x, θ) -α(t, x, θ ′ )| ≤ L|θ -θ ′ |. Moreover it satisfy E T 0 |α 2 s |ds < ∞. Definition B.1 (Infinitesimal Generator).
Let us consider an Itô diffusion process of the form:
dX t = b(t, X t )dt + σ(t) ⊤ dW t ,(34)
Then, an infinitesimal generator of the above diffusion process is given by: 
A t f = lim t↓0 + E [f (X t )] -f (x) t = ∇ x f ⊤ b + 1 2 Trace σσ ⊤ ∇ xx f .(35
P(X t0 ∈ dx t0 , • • • , X t N ∈ dx t N ) = P(dx 0 ) N i=1 P i (x ti-1 , dx ti ),(36)
where
{P i } N i=0 is a sequence of probability kernels from (R d , B(R d )) to (R d , B(R d )), for any event A ∈ B(R d ), P i (x ti-1 , A) = A p i (x ti-1 , x ti )dx ti , where p i (x ti-1 , x ti ) := p(t i , x ti |t i-1 , x ti-1 )
is a transition density obtained by a solution of the Fokker-Placnk equation (Risken & Frank, 2012):
∂ t p t (x t ) = A ⋆ t p t = -∇ x • (bp t ) + 1 2 Trace σσ ⊤ ∇ xx p t ,(37)
where A ⋆ is an adjoint operator of the generator in ( 35) and p t is the Radon-Nikodym density of µ t with respect to the Lebesgue measure. By taking N → ∞, we get the path measure P(X 0:T ∈ dx 0:T ) which describes the weak solutions of the SDE of the form in equation ( 34).
Lemma B.3 (Itô's formula). Let v(t, x) be C 1 in t and C 2 in x and let X t be the Itô diffusion process of the form in equation (34). Then, the stochastic process v(t, X t ) is also an Itô diffusion process satisfying:
dv(t, X t ) = [∂ t v(t, X t ) + A t v(t, X t )] dt + ∇ x v(t, X t ) ⊤ σ(t)dW t . (38
)
Theorem B.4 (Girsanov Theorem). Consider the two Itô diffusion processes of form
dX t = b(t, X t )dt + σ(t, X t ) ⊤ dW t , t ∈ [0, T ],(39)
dY t = b(t, Y t )dt + σ(t, Y t ) ⊤ dW t , t ∈ [0, T ],(40)
where both drift functions b, b and the diffusion function σ assumed to be invertible are adapted to F t and W [0,T ] is P-Wiener process. Moreover, consider P as the path measures induced by (39). Let us define H t := σ -1 ( b -b) which is assumed to be satisfying the Novikov's condition (i.e., E P exp 1 2 T 0 ∥H s ∥ 2 ds < ∞), and the P-martingale process
M t := exp 1 0 H ⊤ s dW s - 1 2 t 0 ∥H s ∥ 2 ds (41) satisfies E P [M T ] = 1.
Then for the path measure Q given as dQ = M T dP, the process Wt = W t -t 0 H s ds is a Q-Wiener process and Y t can be represented as
dY t = b(t, Y t )dt + σ(t, Y t ) ⊤ d Wt , t ∈ [0, T ]. (42
)
Therefore Q-law of the process Y t is same as P-law of the process X t .
B.1 PROOF OF THEOREM 3.1
We start the section by showing the normalizing property of
{f i } i∈[1:k] in (3). By definition, it satisfied that k i=1 L i (g i ) = k i=1 R d g i (y ti |x ti )dP(x 0:T ) (i) = E P k i=1 g i (y ti |x ti ) = Z(H t k ),(43)
where (i) follows from the conditional indenpendency of y ti given x ti for all i ∈ [1 : k]. Hence, we get the normalizing property:
E P k i=1 f i (x ti ) = E P k i=1 g i (y ti |x ti ) k i=1 L ti (g ti ) = 1 Z(H t k ) E P k i=1 g i (y ti |x ti ) = 1.(44)
Theorem 3.1 (Multi-marginal Doob's h-transform). Let us define a sequence of functions
{h i } i∈[1:k] ,
where each
h i : [t i-1 , t i ) × R d → R + , for all i ∈ [1 : k], is a conditional expectation h i (t, x t ) := E P k j≥i f j (y tj |X tj )|X t = x t , where {f i } i∈[1:k] is defined in (3). Now, we define a function h : [0, T ] × R d → R + by integrating the functions {h i } i∈[1:k] , h(t, x) := k i=1 h i (t, x)1 [ti-1,ti) (t). (45
)
Then, with the initial condition µ ⋆ 0 (dx 0 ) = h 1 (t 0 , x 0 )µ 0 (dx 0 ), the solution of the following conditional SDE inducing the posterior path measure P ⋆ in (2):
(Conditioned State) dX ⋆ t = [b(t, X ⋆ t ) + ∇ x log h(t, X ⋆ t )] dt + dW t (46)
Proof. We start with the interval [t i-1 , t i ) without loss of generality. For all t ∈ [t i-1 , t i ) and for any A ti ⊂ B(R d ), the transition kernel of the conditioned process is defined by the transition kernel of the original Markov process P i and h-function defined in (3.1):
P hi i (X ti ∈ A|X t = x) := P hi i (x t , A) = h i (t i , X ti ) h i (t, x t ) P ti (x t , dx ti ). (47
)
By the definition of h, the transition kernel P h ti (x t , A) is a probability kernel:
R d P hi i (x t , dx ti ) = R d h i (t i , X ti ) h i (t, x t ) P i (x t , dx ti ) (48) = R d h i (t i , X ti )P i (x t , dx ti ) h i (t, x t ) (49
) (i) = R d f i (y ti |X ti )h i+1 (t i , X ti )P i (x t , dx ti ) E P k j=i f j (y tj |X tj )|X t = x t (50) = E P k j=i f j (y tj |X tj )|X t = x t E P k j=i f j (y tj |X tj )|X t = x t = 1,(51)
where (i) follows from the recursion established by the definition of h in Theorem 3.1:
h i (t i , x ti ) = f i (y ti |X ti )h i+1 (t i , x ti ), ∀i ∈ [1 : k -1].(52)
Now, the infinitesimal generator for P h can be computed for any φ ∈ C 1,2 ([0, T ] × R d ) and for all P-almost x ∈ R d :
A hi t φ t = lim s↓0 E P h [φ(t s , X t+s )|X t = x] -φ(t, x) s (53) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] P h i i (x,dxt+s) Pi(x,dxt+s) |X t = x s (54) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] hi(t+s,Xt+s) hi(t,x) |X t = x s (55) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] hi(t+s,Xt+s)-hi(t,x) hi(t,x) + 1 |X t = x s (56
) (i) = A t φ t + lim s↓0 E P [[φ(t s , X t+s ) -φ(t, x)] [h i (t + s, X t+s ) -h i (t, x)] |X t = x] sh i (t, x) ,(57)
where (i) follows from the definition of the infinitesimal generator. Now, we shall compute the second term of the RHS of the equation ( 57) adapted from (Léonard, 2011). By employing basic stochastic calculus, for a stochastic process φ t = φ(t, X t ) and h i,t = h i (t, X t ),
φ t+s h i,t+s = φ 0 h i,0 + t+s 0 φ u dh i,u + t+s 0 h i,u dφ u + [φ, h i ] t+s(58)
φ t h i,t = φ 0 h i,0 + t 0 φ u dh i,u + t 0 h i,u dφ u + [φ, h i ] t ,(59)
where [φ, h i ] t = t 0 dφ t dh i,t is quadratic variation of φ and h i . By subtracting equation ( 59) from equation (58),
φ t+s h i,t+s -φ t h i,t = t+s t φ u dh i,u + t+s t h i,u dφ u + [φ, h i ] t+s -[φ, h i ] t(60)
Applying integration by parts leads to the following equation
(φ t+s -φ t )(h i,t+s -h i,t ) = φ t+s h i,t+s -φ t h i,t -φ t (h i,t+s -h i,t ) -h i,t (φ t+s -φ t ) (61) = t+s t (φ u -φ t )dh i,u + t+s t (h i,u -h i,t )dφ u + [φ, h i ] t+s -[φ, h i ] t ,
where φ t (h i,t+s -h i,t ) = t+s t φ t dh i,u . Moreover, by applying Itô's formula, we get
dφ t = A t φ t dt + (∇ x φ) ⊤ dW t , dh i,t = A t h i,t dt + (∇ x h i,t ) ⊤ dW t .(62)
Therefore, since W t is P-martingale,
E P [(φ t+s -φ t )(h i,t+s -h i,t )|X t = x] (63) = E P t+s s (φ u -φ t )A u h i,u du + t+s s (h i,u -h i,t )A u φ u du + [φ, h i ] t+s -[φ, h i ] t |X t = x (64) = E t,x P t+s t (φ u -φ t )A u h i,u du (A) + E t,x P t+s t (h i,u -h i,t )A u φ u du (B) + E t,x P [[φ, h i ] t+s -[φ, h i ] t ] (C)
For a first term (A), the Hölder's inequality with 1/p + 1/q = 1 and p, q ≥ 1 yields,
E t,x P t+s t (φ u -φ t )A u h i,u du ≤ E t,x P t+s t |φ u -φ t | q du 1/q E t,x P t+s t |A t h i,u | p du 1/p (65) = E t,x P t+s t |φ u -φ t | q du 1/q t+s t E t,x P [|A t h i,u | p ] du 1/p (66)
Given the bounded and continuous function φ, the dominated convergence theorem yields (lim s↓0 E t,x
P t+s t |φ u -φ t | q du ) 1/q = (E t,x P lim s↓0 t+s t
|φ u -φ t | q du ) 1/q = 0 and since
h i ∈ C 1,2 ([t i-1 , t i ), R d
) and boundedness of b in Assumptions B, following inequality holds
|A u h i,u | p ≤ |∂ t h i,t | p + |(∇ x h T i,t )b| p + | 1 2 Trace [∇ xx h i,t ] | p < ∞,(67)
for any u ∈ [t i-1 , t i ) and P almost surely. In other words,
sup u∈[t,t+s] E t,x P [|A u h i,u | p ] < ∞, ∀t ∈ [0, T ],
s > 0, and p > 1. Therefore we get lim s↓0 E t,x P t+s t (φ u -φ t )A u h i,u du = 0 and we can get the similar result for the second term (B), i.e., lim s↓0 E t,x P t+s t
(h i,u -h i,t )A u φ u du = 0.
For the last term (C), by the definition of the quadratic variation of φ and h i yields:
E t,x P [[φ, h i ] t+s -[φ, h i ] t ] = E t,x P t+s t dφ u dh i,u = E t,x P t+s t (∇ x φ u ) ⊤ ∇ x h i,u du(68)
Subsequently, by taking the limit from (57), we get the following result:
lim s↓0 E P [[φ(t s , X t+s ) -φ(t, x)] [h i (t + s, X t+s ) -h i (t, x)] |X t = x] sh i (t, x) (69) = lim s↓0 E P t+s t (∇ x φ(u, X u )) ⊤ ∇ x h i (u, X u )du|X t = x sh i (t, x) (70) = (∇ x φ(t, X t )) ⊤ ∇ x log h i (t, X t ),(71)
Therefore, the infinitesimal generator for P hi i is defined by:
A hi t φ t = A t φ t + (∇ x φ t ) ⊤ ∇ x log h i,t(72)
which shows that the conditioned SDE for an interval [t i-1 , t i ) is given by
dX h t = [b(t, X t ) + ∇ x log h i (t, X t )] dt + dW t (73)
Hence, integrating the generators over the entire interval yields:
A h t φ = A t φ t + k i=1 (∇ x φ t ) ⊤ ∇ x log h i,t 1 [ti-1,ti) (t) (74) = A t φ t + (∇ x φ t ) ⊤ ∇ x log h t . (75
)
It implies that the conditioned SDE for the entire interval [0, T ] is given by:
dX h t = [b(t, X t ) + ∇ x log h(t, X t )] dt + dW t .(76)
Now, assume that X h t ∼ µ ⋆ 0 (x) where µ ⋆ 0 (x) is absolutely continuous with µ 0 (x). Then, the path measure induced by X h t can be computed as:
dP h (x 0:T ) = dµ ⋆ 0 (x 0 ) k i=1   N j=1 P hi i(j) (x t i(j-1) , dx ti(j) )   (77) = dµ ⋆ 0 (x 0 ) k i=1 h i (t i , x ti ) h i (t i-1 , x ti-1 )   N j=1 P i(j) (x t i(j-1) , dx ti(j) )   (78) = dµ ⋆ 0 (x 0 ) k i=1 h i+1 (t i , x ti )f i (y ti |x ti ) h i (t i-1 , x ti-1 )   N j=1 P i(j) (x t i(j-1) , dx ti(j) )   (79) N ↑∞ = dµ ⋆ 0 dµ 0 (x 0 ) h k+1 (t k , x t k ) h 1 (t 0 , x 0 ) k i=1 f i (y ti |x ti )dP(x 0:T ) (80)
where, for all i ∈ [1 : k], we define a increasing sequence {i(j)} j∈[0:N ] with i(0
) = i -1, i(1) = i -1 + 1 N and i(N ) = i. Hence, for a dµ ⋆ 0 (x 0 ) = h 1 (t 0 , x 0 )dµ 0 (x 0 ) and h k+1 = 1 yields dP h (x 0:T ) = k i=1 f i (y ti |x ti )dP(x 0:T ) (81) = 1 Z(H t k ) k i=1 g i (y ti |x ti )dP(x 0:T ) (82) = dP ⋆ (x 0:T ).(83)
It concludes the proof.
B.2 PROOF OF THEOREM 3.2 Theorem 3.2 (Dynamic Programming Principle). Let us consider a sequence of left continuous functions {V i } i∈[1:k+1] , where each
V i ∈ C 1,2 ([t i-1 , t i ) × R d ) V i (t, x t ) := min α∈A E P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )|X t = x t ,(84)
for all i ∈ [1 : k] and V k+1 = 0. Then, for any 0 ≤ t ≤ u ≤ T , the value function V for the cost function in (7) satisfying the recursion defined as follows:
V(t, x t ) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   , (85) with the indexing function I(u) = max{i ∈ [1 : k]|t i ≤ u}.
Proof. Following the approach used in the proof of the standard dynamic programming principle with the flow property induced by Markov control (Van Handel, 2007), we can apply similar methods to our cost function. We start the proof by establishing the recursion of J . Let us define the sequence of left continuous cost functions {J i } i∈[1:k+1] :
J i (t, x t , α) := E t,xt P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + J i+1 (t i , X α ti , α) ,(86)
where we denote
E t,x P [•] = E P [•|X t = x]
, for all i ∈ [1 : k] and J k+1 = 0. Since P α is Markov process, it satisfying following recursion, for any 0 ≤ t ≤ u ≤ T ,
J (t, x t , α) = E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + J I(u)+1 (t I(u) , X α t I(u) , α)   ,(87)
with the indexing function
I(u) = max{i ∈ [1 : k]|t i ≤ u}.
For any ϵ > 0, there exists a control α
′ ∈ A[t, T ] such that V(t, x) + ϵ ≥ J (t, x, α ′ ) (88) = E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + J I(u)+1 (t I(u) , X α ′ t I(u) , α ′ )   (89) ≥ E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )  (90)
≥ min
α ′ ∈A[t,T ] E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )  (91)
Since ϵ was arbitrary, limiting ϵ → 0 we get:
V(t, x) ≥ min α ′ ∈A[t,T ] E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )   .
(92) For the reverse direction, consider the control α ∈ A[t, T ] obtained from integrating:
αs := α 1 s , s ∈ [t, t I(u) ) α 2 s s ∈ [t I(u) , T ].(93)
Then, by following the definition of the value function
J (t, x, α) ≥ min α 2 ∈A[t I(u) ,T ] J (t, x, α)(94)
= E t,xt P α 1   t I(u) t 1 2 α 1 s 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α 1 ti ) + V I(u)+1 (t I(u) , X α 1 t I(u) )  (95)
≥ min
α 1 ∈A[t,t I(u) ) E t,xt P α 1   t I(u) t 1 2 α 1 s 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α 1 ti ) + V I(u)+1 (t I(u) , X α 1 t I(u) )   (96) = min α∈A[t,T ] E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )   (97) ≥ V(t, x).(98)
Combining both inequalities in (92, 97-98), we arrive at the desired result:
V(t, x) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   .
(99) This concludes the proof.

Section: B.3 PROOF OF THEOREM 3.3
Theorem 3.3 (Verification Theorem). Suppose there exist a sequence of left continuous functions
V i (t, x) ∈ C 1,2 ([t i-1 , t i ), R d ), for all i ∈ [1 : k],
satisfying the following Hamiltonian-Jacobi-Bellman (HJB) equation:
∂ t V i,t + A t V i,t + min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = 0, t i-1 ≤ t < t i (100) V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x), t = t i , ∀i ∈ [1 : k],(101)
where a minimum is attained by α
⋆ i (t, x) = ∇ x V i (t, x). Now, define a function α : [0, t k ]×R d → R d by integrating the optimal controls {α i } i∈{1,••• ,k} , α ⋆ (t, x) := k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t)(102)
Then, J (t, x t , α ⋆ ) ≤ J (t, x t , α) holds for any (t, x t ) ∈ [0, T ] × R d and α ∈ A. In other words, α ⋆ is optimal control for V in (9).
Proof. Without loss of generality, consider t ∈ [t i-1 , t i ). By applying the Itô's formula to the value function V and taking expectation with respect to P α , we obtain
E t,xt P α V i (t i , X α ti ) = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s ds ,(103)
where we denote
E t,x P [•] = E P [•|X t = x]
. By adding the Lagrangian term E t,xt
P α ti t 1 2 ∥α i,s ∥ 2 ds
to both sides of (103), we get for the LHS of ( 103)
LHS = E t,xt P α V i (t i , X α ti ) + E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds (104) = E t,xt P α V i (t i , X α ti ) + ti t 1 2 ∥α i,s ∥ 2 ds (105
) (i) = E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )(106)
= J i (t, x, α),(107) where (i
) follows from the definition of HJB equation in (11). Now for the RHS of (103), we have:
RHS = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s ds + E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds (108) = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds (109
) (i) = V i (t, x) + E t,xt P α ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds ,(110)
where (i) follows from the definition of HJB equation in (10). Therefore, we get the following result:
J i (t, x, α) = V i (t, x) + E t,xt P α ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds . (111) Due to the fact that ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds ≥ 0,
for all t ∈ [t i-1 , t i ) and P α almost x, we conclude that J i (t, x, α) ≥ V i (t, x), where the equality
holds for α ⋆ i,t = min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = -∇ x V i,t . Additionaly, it implies that J i (t, x, α ⋆ ) = V i (t, x).
Subsequently, for any t ∈ [t i-1 , t i ), the recursion in (9) yields:
V(t, x t ) = min α∈A E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )   (112) = min α∈A E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti ) (113) = V i (t, x t ).(114)
This implies that the optimal control for the value function V in (9) over the interval t ∈ 9) can be represented as the integrated form :
[t i-1 , t i ) is α ⋆ i . Finally, V in (
V(t, x t ) = k i=1 V i (t, x t )1 [ti-1,ti) .(115)
Therefore, the optimal control for V in (9) becomes α(t, x)
= k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t).
B.4 PROOF OF LEMMA 3.4
We first restate the Feynman-Kac formula (Oksendal, 1992;Baldi, 2017), which gives a probabilistic representation of the solution to certain PDEs using expectations of stochastic processes. It relies on the fact that the conditional expectation is martingale process.
Lemma B.5 (The Feynman-Kac formula). Let us define f ∈ C 2 (R d ) and g ∈ C(R d ). Then, a function h(t, x t ) = E P e -T t f (s,Xs)ds g(X T )|X t =
x t is a solution of the following linear PDE:
∂ t h t + A t h t -f h t = 0, 0 ≤ t < T, (116) h(t, x) = g(X T ), t = T.(117)
Proof. Define the process Y t = e -T t f (s,Xs)ds h(t, X t ). Since h is a conditional expectation with respect to P, implies that Y t is martingale process. By applinyg Itô formula, we have:
dY t = -f (t, X t )e -T t f (s,Xs)ds h(t, X t )dt + e -T t f (s,Xs)ds dh(t, X t ).(118)
Next, we apply Itô formula to h(t, X t ):
dh(t, X t ) = ∂h ∂t + A t h t dt + ∇ x h(t, X t ) ⊤ dW t ,(119)
Thus, by substituting equation (119) into equation ( 118), we get (120) For Y t to be a martingale process, it will have zero drift. Therefore, it implies that
dY t = -f (t
∂h ∂t (t, X t ) + A t h(t, X t ) -f (t, X t )h(t, X t ) = 0,(121)
where h(T, X T ) = g(X T ) by definition. It concludes the proof. Now let us provide the proof of the Lemma 3.4.
Lemma 3.4 (Hopf-Cole Transformation). The h function satisfying the following linear PDE:
∂ t h i,t + A t h i,t = 0, t i-1 ≤ t < t i (122) h i (t i , x) = f i (y ti |x)h i+1 (t i , x), t = t i , ∀i ∈ [1 : k].(123)
Moreover, for a logarithm transformation V = -log h, V satisfying the HJB equation in (10)(11).
Proof. The linaer PDE presented in (122-123) can be directly derived from the function h i (t, x t ) =
E P k j≥i-1 f j (X tj )|X t = x t .
This is done by applying the Feynman-Kac formula in Lemma B.5 with f := 0 and g = k j=i-1 f j (X tj ). Now, let us consider the function V i (t, x) = -log h i (t, x) (or h i (t, x) = e -Vi (t,x) ) for all i ∈ [1 : k] and compute
∂ t h i,t = -h i,t ∂ t V i,t , ∇ x h i,t = -h i,t ∇ x V i,t , ∇ xx h i,t = h i,t (∥∇ x V i,t ∥ 2 -∇ xx V i,t ). (124)
Then, it is straightforward to compute:
h i,t ∂ t V i,t = -∂ t h i,t (i) = A t h i,t(125)
= (∇ x h i,t ) ⊤ b t + 1 2 Trace [∇ xx h i,t ] (126) = (-h i,t ∇ x V i,t ) ⊤ b t + 1 2 Trace h i,t (∥∇ x V i,t ∥ 2 -h i,t ∇ xx V i,t ) (127) = (-h i,t ∇ x V i,t ) ⊤ b t + 1 2 Trace h i,t ∥∇ x V i,t ∥ 2 - 1 2 Trace [h i,t ∇ xx V i,t ] ,(128)
where (i) follows from ( 122). Now, we can simplify (128) by dividing both sides with h > 0:
∂ t V i,t = (-∇ x V i,t ) ⊤ b t + 1 2 Trace ∥∇ x V i,t ∥ 2 - 1 2 Trace [∇ xx V i,t ] (129) = -A t V i,t + 1 2 ∥∇ x V i,t ∥ 2 (130)
Therefore, combining the above results, we have
∂ t V i,t + A t V i,t - 1 2 ∥∇ x V i,t ∥ 2 = 0, V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x).(131)
Since
min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = -1 2 ∥∇ x V i,t ∥
2 , this concludes the proof.
B.5 PROOF OF COROLLARY 3.5
Corollary 3.5 (Optimal Control). For optimal control α ⋆ induced by the cost function ( 7) with dynamics (6), it satisfies α ⋆ = ∇ x log h. In other words, we can simulate the conditional SDEs in ( 5) by simulating the controlled SDE ( 6) with optimal control α ⋆ .
Proof. Lemma 3.4 implies that we can obtain the relation -V i (t, x) = log h i (t, x). Moreover, by following the definition of the optimal control α ⋆ in (12) and the value function V in (115), it suggest that α ⋆ (t, x) = -∇ x V(t, x) for all t ∈ [0, T ]. Finally, combining the definition of the h function in (4) and the results from Lemma 3.4, we can conclude that α ⋆ (t, x) = -∇ x V(t, x) = ∇ x log h(t, x) for all t ∈ [0, T ].
B.6 PROOF OF THEOREM 3.6
The Donsker-Varadhan variational principle provides a variational formula for the large deviations of functionals of Brownian motion, often related to free-energy minimization problems (Boué & Dupuis, 1998). Moreover, through Girsanov's theorem, this principle extends to a wide range of Markov processes, including Itô diffusion processes (Hartmann et al., 2017;Tzen & Raginsky, 2019). Lemma B.6 (Donsker-Varadhan Variational Principle). For a bounded and measurable functions W : C([0, T ], R d ) → R, following relation holds:
-log E X∼P e -W(X 0:T ) = min
Q≪P [E Y∼Q [W(Y 0:T )] + D KL (Q|P)](132)
Proof. The proof relies on the change of measure and the Jensen's inequality:
-log E X∼P e -W(X 0:T ) = -log E Y∼Q e -W(Y 0:T ) dP dQ (Y 0:T ) (133)
≤ E Y∼Q W(Y 0:T ) -log dP dQ (Y 0:T ) (134) = E Y∼Q [W(Y 0:T )] + D KL (Q|P),(135)
where the equality holds if and only if dQ dP (Y 0:T ) = e -log E X∼P[ e -W(X 0:T ) ]-W(Y0:T ) .
(136)
It concludes the proof.
Theorem 3.6 (Tight Variational Bound). Let us assume that the path measure P α induced by ( 6) for any α ∈ A satisfies D KL (P α |P ⋆ ) < ∞. Then, for a cost function J in (7) and µ ⋆ 0 in (5), it holds:
D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] = L(α) + log g(H t k ) ≥ 0,(137)
where the objective function L(α) (negative ELBO) is given by:
L(α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti ) ≥ -log g(H t k ).(138)
Moreover, assume that L(α) has a global minimum α ⋆ = arg min α∈A L(α). Then the equality holds in ( 16) i.e., L(α ⋆ ) = -log g(H t k ) and µ ⋆ 0 = µ 0 almost everywhere with respect to µ 0 .
Proof. We begin by deriving the KL-divergence between P α and P ⋆ :
D KL (P α |P ⋆ ) (i) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [D KL (P α (•|X α 0 )|P ⋆ (•|X α 0 ))|X ⋆ 0 = x 0 ] (139) =D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 E P α log dP α dP ⋆ (X α 0:T )|X α 0 = x 0 (140) =D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 E P α log dP α dP (X α 0:T ) + log dP dP ⋆ (X α 0:T )|X α 0 = x 0 ,(141)
where (i) follows from the disintegration theorem (Léonard, 2013). For the second term of the RHS of ( 141), applying the Girsanov's theorem in B.4, we have:
E 0,x0 P α log dP α dP (X α 0:T ) = E 0,x0 P α T 0 α t d Wt + T 0 1 2 ∥α t ∥ 2 dt = E 0,x0 P α T 0 1 2 ∥α t ∥ 2 dt .(142)
Moreover, by the definition of P ⋆ , we get:
E 0,x0 P α log dP dP ⋆ (X α 0:T ) = E 0,x0 P α - k i=1 log f i (y ti |X α ti ) . (143
)
Combining the results from ( 142) and ( 143), we obtain:
E 0,x0 P α log dP α dP (X α 0:T ) + log dP dP ⋆ (X α 0:T ) = E 0,x0 P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X α ti ) (144) = E 0,x0 P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X α ti ) (145) = J (0, x 0 , α). (146
)
Moreover, by the definition of {f i } i∈[1:k] , we get
J (0, x 0 , α) = E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X ti )|X α 0 = x 0 (147
) (i) = E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log g i (y ti |X ti )|X α 0 = x 0 + log Z(H t k ),(148)
where (i) follows from normalizing property in (44). Hence, it result the that:
D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] (149) = D KL (µ 0 |µ ⋆ 0 ) + L(α) + log Z(H t k ).(150)
For a KL-divergence term D KL (µ 0 |µ ⋆ 0 ), since dµ ⋆ 0 (x 0 ) = h 1 (t 0 , x 0 )dµ 0 (x 0 ), we obtain
D KL (µ 0 |µ ⋆ 0 ) = E x0∼µ0 log dµ 0 dµ ⋆ 0 (x 0 ) = E x0∼µ0 [-log h 1 (0, x 0 )] (151) = E x0∼µ0 [V(0, x 0 )] (152) = E x0∼µ0 [J (0, x 0 , α ⋆ )] (153
) (i) = E x0∼µ0 J (0, x 0 , α ⋆ ) + log Z(H t k ) (154
) (ii) = L(α ⋆ ) + log Z(H t k ),(155)
where (i) follows from:
J (0, x 0 , α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log f i (y ti |X α ti )|X α 0 = x 0 (156) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti )|X α 0 = x 0 J (0,x0,α) + log Z(H t k ).(157)
Thus, we obtain α ⋆ = arg min α∈A J (0, x 0 , α) = arg min α∈A J (0, x 0 , α) as log Z(H t k ) is constant. It result that J (0, x 0 , α ⋆ ) = J (0, x 0 , α ⋆ ) + log Z(H t k ). Additionally, (ii) follows from
E x0∼µ0 min α∈A J (0, x 0 , α) = min α∈A E x0∼µ0 J (0, x 0 , α) = min α∈A L(α) = L(α ⋆ )(158)
since the minimization is independent of the initial condition, as implied by the disintegration theorem (Léonard, 2013). Hence, since the Donsker-Varadhan variational principle in Lemma B.6 with W(X α 0:T ) = -k i=1 log g i (y ti |X α ti ) yields
E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log g i (y ti |X α ti ) + log E P k i=1 g i (y ti |X α ti ) ≥ 0,(159)
it implies that D KL (µ 0 |µ ⋆ 0 ) = L(α ⋆ ) + log Z(H t k ) = 0 for the optimal control α ⋆ . In other words, µ ⋆ 0 = µ 0 almost everywhere with respect to µ 0 and from the variational bound
D KL (P α |P ⋆ ) = L(α) + log Z(H t k ) ≥ 0,(160)
the equality holds if and only if α → α ⋆ . It concludes the proof.

Section: B.7 DERIVATION OF AMORTIZED ELBO IN (28).
Let o 0:T is given time-series data. Then, for an auxiliary variable y 0:T ∼ q ϕ (y 0:T |o 0:T ), the ELBO is given as
log p ψ (o 0:T ) ≥ E q ϕ (y 0:T |o 0:T ) log K i=1 p ψ (o ti |y ti )g(y 0:T ) K i=1 q ϕ (y ti |o ti ) (161) = E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) + log g(y 0:T ) -log K i=1 q ϕ (y ti |o ti ) (162) ≥ E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) -log K i=1 q ϕ (y ti |o ti ) (163) = E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) - K i=1 log q ϕ (y ti |o ti ) (164
) (i) ≥ E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) ,(165)
where (i) follows from E q ϕ (y 0:T |o 0:T ) -K i=1 log q ϕ (y ti |o ti ) = C ≥ 0 since q ϕ is Gaussian distribution with constant covariance matrix.
B.8 PROOF OF THEOREM 3.8 Theorem 3.8 (Simulation-free estimation). Let us consider sequences of SPD matrices {A i } i∈[1:k] that admit the eigen-decomposition
A i = ED i E ⊤ with E ∈ R d×d and D i ∈ diag(R d ) ⪰ 0 for all i ∈ [1 : k], control vectors {α i } i∈[1:k]
and following control-affine SDEs for all i ∈ [1 : k]:
dX t = [-A i X t + α i ] dt + σdW t , t ∈ [t i-1 , t i ).
(166)
Then, with X 0 ∼ N (m 0 , Σ 0 ), the solution of ( 166) is a Gaussian process N (m ti , Σ ti ) with:
m ti = E e -i j=1 (tj -tj-1)Dj mt0 - i k=1 e -i j=k (tj -tj-1)Dj D -1 k I -e (t k -t k-1 )D k αk , Σ ti = E e -2 i j=1 (tj -tj-1)Dj Σt0 - 1 2 i k=1 e -2 i j=k (tj -tj-1)Dj D -1 k I -e 2(t k -t k-1 )D k E ⊤ , where mti = E ⊤ m ti , Σti = E ⊤ Σ ti E and αi = E ⊤ α i for all i ∈ [1 : k].
Proof. Note that since {A i } i∈[1:k] are SPD matrices, we can apply the transformation outlined in Remark 3.7 and express the original dynamics (166) in the projected form using the eigenbasis E.
Then, for any t ∈ [t i , t i+1 ), the solution to (166) at time t is given as Xt = e -∆i(t)Di Xti +
t ti e ∆i(s)Di αi ds + t ti e ∆i(s)Di d Ŵs , ∆ i (t) = t -t i , for t > t i 0, for t ≤ t i , (167) where mti = E ⊤ m ti , Σti = E ⊤ Σ ti E and αi = E ⊤ α i for all i ∈ [1 : k] and Ŵt = E ⊤ W t . Given that we have defined X 0 ∼ N (m 0 , Σ 0 ), Xti = E ⊤ X ti is a Gaussian process for any i ∈ [1 : k]. The first two moments of Gaussian process can be computed from (167). First, since D i is diagonal, the integral can be computed as t ti e ∆i(s)Di ds = -D -1 i (I -e ∆i(t)Di ) and M i (t) := t ti e ∆i(s)Di d Ŵs is a martingale process with respect to P α i.e., E P α [M i (t)] = 0. Hence, since αi is time-invariant vector, the mean E P [ Xt ] = mt for t ∈ [t i , t i+1 ) can be computed as mt = e -∆i(t)Di mti -e -∆i(t)Di D -1 i (I -e ∆i(t)Di ) αi .
Secondly, for a covariance
E P [( Xt -mt )( Xt -mt ) ⊤ ] = Σt , we can compute Σt = E P α e -2∆i(t)Di Xti -mti + M i (t) Xti -mti + M i (t) ⊤ (169
) (i) = e -2∆i(t)Di E P α ( Xti -mti )( Xti -mti ) ⊤ + ∥M i (t)∥ 2 2 (170) (ii) = e -2∆i(t)Di Σti - 1 2 e -2∆i(t)Di D -1 i (I -e 2∆i(t)Di ),(171)
where (i) follows from the fact that M i (t) is a martingale and we use Itô isometry in (ii):
E P α ∥M i (t)∥ 2 2 = E P α t ti e ∆i(s)Di 2 2 ds = - 1 2 e -2∆i(t)Di D -1 i (I -e 2∆i(t)Di ).(172)
Hence, we get the Gaussian law of Xt at time t ∈ [t i , t i+1 ), N ( mt , Σt ). Furthermore, given recurrence forms of mean (168) and covariance (171), the first two moments of Gaussian distribution for each time steps t i can be computed sequentially. For a mean mti we have,
mt1 = e -∆0(t1)D1 mt0 -e -∆0(t1)D D -1 1 (I -e ∆0(t1)D1 )α 1 (173) mt2 = e -2 j=1 ∆j-1(tj )Dj mt0 (174) -e -2 j=1 ∆j-1(tj )Dj D -1 1 (I -e ∆0(t1)D1 ) α1 -e -∆1(t2)D2 D -1 2 (I -e ∆1(t2)D2 ) α2 (175) . . . (176
) mti = e -i j=1 ∆j-1(tj )Dj mt0 - i k=1 e -i j=k ∆j-1(tj )Dj D -1 k I -e ∆ k-1 (t k )D k αk(177)
Moreover, for a covariance Σti , similar calculation yields
Σt1 = e -2∆0(t1)D1 Σt0 - 1 2 e -2∆0(t1)D D -1 1 (I -e 2∆0(t1)D1 ) (178) Σt2 = e -2 2 j=1 ∆j-1(tj )Dj Σt0 (179) - 1 2 e -2 2 j=1 ∆j-1(tj )Dj D -1 1 (I -e 2∆0(t1)D1 ) - 1 2 e -2∆1(t2)D2 D -1 2 (I -e 2∆1(t2)D2 ) (180) . . . (181
) Σti = e -2 i j=1 ∆j-1(tj )Dj Σt0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj D -1 k I -e 2∆ k-1 (t k )D k (182)
Now, since D i = EA i E ⊤ and the orthonormality of E, we can express the mean and covariance in the original canonical basis. For the mean, we get
E mti = E e -i j=1 ∆j-1(tj )Dj mt0 - i k=1 e -i j=k ∆j-1(tj )Dj D -1 k I -e ∆ k-1 (t k )D k αk (183
) (i) = E e -i j=1 ∆j-1(tj )Dj E ⊤ m t0 - i k=1 e -i j=k ∆j-1(tj )Dj E ⊤ A -1 k E I -e ∆ k-1 (t k )D k E ⊤ α k (184
) (ii) = e -i j=1 ∆j-1(tj )Aj m t0 - i k=1 e -i j=k ∆j-1(tj )Aj A -1 k I -e ∆ k-1 (t k )A k α k (185) = m ti ,(186)
where (i) follows from mt0 = E ⊤ m t0 , and αi = E ⊤ α i for all i ∈ [1 : k] and (ii) follows from
D -1 i = E ⊤ A -1 i E and e -D -1 i = E ⊤ e -A -1 i E.
Similarly, for the covariance, we get
E Σti E ⊤ = E e -2 i j=1 ∆j-1(tj )Dj Σt0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj D -1 k I -e 2∆ k-1 (t k )D k E ⊤ (187
) (i) = E e -2 i j=1 ∆j-1(tj )Dj E ⊤ Σ t0 E - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj E ⊤ A -1 k E I -e 2∆ k-1 (t k )D k E ⊤ (188) = e -2 i j=1 ∆j-1(tj )Aj Σ t0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Aj A -1 k I -e 2∆ k-1 (t k )A k (189) = Σ ti ,(190)
where (i) follows from Σt0 = E ⊤ Σ t0 E. By applying the procedure from (168-171) to the original SDE (166), we recover the mean and covariance expressions in ( 185) and ( 189). This shows that the transformed mean and covariance in ( 177) and ( 182) can indeed be projected back into the original basis, completing the proof. 
i } i∈[1:k] , matrices {D} i∈[1:k] . 2: Compute {∆ i (t i ), Di , Ĉi , Di , Ci } i∈[1:k] 3: Set {M i } i∈[1:k] = { Di , Ĉi αi } i∈[1:k] and {S i } i∈[1:k] = { Di , Ci 1 } i∈[1:k] . 4: Parallel Scan {M ′ i , S ′ i } i∈[1:k] = ParallelScan({M i , S i } i∈[1:k] , ⊗) 5: ⇒ Algorithm 2 for ParallelScan 6: for i = 1 to K do in parallel 7: m ti = M ′ (1) i m t0 + M ′ i (2) 8: Σ ti = S ′ (1) i Σ t0 + S ′ i (2)
9: end for 10: Return m t∈T , Σ t∈T

Section: C PARALLEL SCAN
Given an associative operator ⊗ and a sequence of elements [s t1 , • • • s t K ], the parallel scan algorithm (Blelloch, 1990) computes the all-prefix-sum which returns the sequence [s t1 , (s t1 ⊗ s
t2 ), • • • , (s t1 ⊗ s t2 ⊗ • • • ⊗ s t K )] in O(log K) time.
Since we have verified that moments { mti , Σti } i∈[1:k] of the controlled distributions can be estimated by the recurrences in (168,171):
mti = Di mti-1 + Ĉi αi (191) Σti = Di Σti-1 + Ci 1,(192)
where, we define Di = e -∆i-1(ti)Di , Ĉi = -e -∆i-1(ti)Di D -1 i (I -e ∆i-1(ti)Di ) (193)
Di = e -2∆i-1(ti)Di , Ci = -1 2 e -2∆i-1(ti)Di D -1 i (I -e 2∆i-1(ti)Di ), 
We can verify that ⊗ is associative operator since it satisfying: for each sub-tree of height d do in parallel 5:
Let i = 2 d+1 k + 2 d+1 -1 for k = 0, 1, . . .

Section: 6:
if i < K 7:
F i = F i-2 d ⊗ F i 8:
end if 9:
end for 10: end for 11: Down-Sweep Stage. 12: F K = I, where I is the identity element for ⊗. 13: for d = ⌈log 2 K⌉ -1 to 0 do 14:
for each sub-tree of height d do in parallel 15:
Let i = 2 d+1 k + 2 d+1 -1 for k = 0, 1, . . .  Human Activity The Human Activity dataset6 consists of time series data collected from five individuals performing different activities. Following the preprocessing steps described in (Rubanova et al., 2019), we obtained 6, 554 sequences, each with 211 time points and a fixed sequence length of 50 irregularly sampled time stamps. The time range was rescaled to [0, 1]. The classification task involves assigning each time point to one of seven categories: "walking", "falling", "lying", "sitting", "standing up", "on all fours", or "sitting on the ground". The dataset was split into 4,194 sequences for training, 1,049 for validation, and 1,311 for testing. Pendulum The pendulum images were algorithmically generated through numerical simulation as outlined in (Becker et al., 2019). We followed the setup described in (Schirmer et al., 2022), where 4,000 image sequences were generated. Each sequence consists of 50 time stamps, irregularly sampled from T = 100, with each image being a 24 × 24 pixel representation. The sequences was further corrupted by a correlated noise process, as detailed in (Becker et al., 2019). For our experiments, we used 2,000 sequences for training and 1,000 sequences for validation and testing.
USHCN The USHCN dataset (Menne et al., 2015) 7 includes daily measurements from 1, 218 weather stations across the US, covering five variables: precipitation, snowfall, snow depth, and minimum and maximum temperature. We follow the pre-processing steps outlined in (De Brouwer et al., 2019), but select a subset of 1, 168 stations over a four-year period starting from 1990, consistent with (Schirmer et al., 2022). Moreover, we make the time series irregular by subsampling 50% of the time points and randomly removing 20% of the measurements. We normalize the features to lie within the range [0, 1] and split into 60% for training, 20% for validation, and 20% for testing.
Physionet The Physionet dataset (Silva et al., 2012) 8 contains 8000 multivariate clinical time-series obtained from the intensive care unit (ICU). Each time-series includes various clinical features recorded during the first 48 hours after the patient's admission to the ICU. We preprocess the data as in (Rubanova et al., 2019). Although the dataset contains a total of 41 measurements, we eliminate 4 static features, i.e., age, gender, height, and ICU-type, leaving 37 time-varying features. We round the time-steps to 6-minute intervals, following (Schirmer et al., 2022). We normalize the features to lie within the range [0, 1] and split into 60% for training, 20% for validation, and 20% for testing. Training For all experiments, except for human activity classification, we followed the same experimental setup as CRU (Schirmer et al., 2022) 9 . For the human activity classification task, we used the setup described in mTAND (Shukla & Marlin, 2021)10 . For a fair comparison, we kept the number of model parameters similar to mTAND for the Human Activity dataset and CRU for the other datasets. The model was trained using the Adam optimizer (Diederik, 2014) in all experiments. For the per-point classification and regression tasks, we applied a weight decay of 1 × 10 -2 and applied gradient clipping for classification task, while no weight decay was used for other tasks. Additionally, to prevent overfitting, we limited the training epochs to 200 for the Physionet extrapolation experiments. The remaining training hyper-parameters are detailed in Table 4.
To estimate the objective function (28), for the decoder p ψ (•|y) = N (p ψ (y), Σ p ), we used a Gaussian likelihood with a fixed variance of Σ p = 0.01 • I, following the approach outlined in (Rubanova et al., 2019) for the Pendulum, USHCN, and Physionet datasets. For the Human Activity dataset, we employed a categorical likelihood. Additionally, for L(α), the initial conditions (m 0 , Σ 0 ) of the latent state were trainable parameters, initialized randomly. The covariance Σ 0 was set using an exponential transformation, with a small constant (ϵ = 10 -6 ) added to ensure positivity. The latent state

Section: ACKNOWLEDGMENTS
This work was partly supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.RS-2019-II190075, Artificial Intelligence Graduate School Program(KAIST), No.RS-2022-II220713, Meta-learning Applicable to Real-world Problems, No.RS-2024-00509279, Global AI Frontier Lab).

Section: Attention
The mask of the Full assimilation scheme only blocks the attention scores corresponding to unseen time stamps. Using these masks, masked attention is calculated. For History assimilation scheme, the latent variables z ti include information up to time t i , while for Full assimilation scheme, z ti incorporates all available information. Finally, latent variables corresponding to unseen time stamps are filled with the nearest past latent variable value.

Section: Algorithm 3 Training ACSSM
Input. Time-series o t∈T ′ observed over the entire time stamps
with the observed time stamps T and unseen time stamps T u , encoder neural network q ψ , decoder neural network p ϕ , trainable latent parameters (m 0 , Σ 0 ) and neural networks f θ and T θ .
Compute q ϕ (y t∈T |o t∈T ) by using (25) on observed time stamps T . Sample latent observations y t∈T ∼ q ϕ (y t∈T |o t∈T ) Parallel computation of objective function L(α) in ( 16) if history assimilation then Estimate latent variables
Sample latent predictions ỹt∈T ′ ∼ g(y t∈T ′ |X α t∈T ′ ) on entire time stamps T ′ . Compute p ψ (o t∈T ′ |ỹ t∈T ) by using (26) on entire time stamps T ′ Optimize ELBO(ψ, ϕ, θ) by using (28) with gradient descent. end for transition matrix A(•) was initialized as A = L l=1 ED l E ⊤ , where E was initialized orthonormally following (Lezcano Casado, 2019), and the diagonal matrices {D l } l∈[1:L] were initialized randomly and passed through a negative exponential to keep the values negative. For the potentials 
Architecture In all experiments except for the Pendulum dataset, the time series o 0,T was provided with the observation mask concatenated. We was used a dropout rate of 0.2 for a Human Activity, while no dropout rate was used for the other experiments. For our method, the networks used for each dataset are listed in below, where d is the dimension of latent space R d as described in Table 4.
• Human Activity (Input size, I=12)
• Pendulum (Input size, I=1×24×24) -Encoder network q ϕ : Input(I) → Conv2d(1, 12, kernel size=5, stride=4, padding=2) → ReLU() → MaxPool2d(kernel size=2, stride=2 → Conv2d(12,12, kernel size=3, stride=2, padding=1 


References:
[b0]  Abdul (). . 
[b1] Fatir Ansari; Alvin Heng; Andre Lim; Harold Soh (2023). Neural continuous-discrete state space models for irregularly-sampled time series. 
[b2] Louise Elizabeth; Gefan Baker;  Yang; Christy Michael L Severinsen; Anna Hipsley; Stefan Sommer (2024). Conditioning non-linear and infinite-dimensional diffusion processes. 
[b3] P Baldi (2017). Stochastic Calculus: An Introduction Through Theory and Exercises. Springer International Publishing
[b4] Philipp Becker; Harit Pandya; Gregor Gebhardt; Cheng Zhao; C James Taylor; Gerhard Neumann (2019). Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces. PMLR
[b5] Julius Berner; Lorenz Richter; Karen Ullrich (2024). An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research
[b6] E Guy;  Blelloch (1990). Prefix sums and their applications. 
[b7] Michelle Boué; Paul Dupuis (1998). A variational representation for certain functionals of brownian motion. The Annals of Probability
[b8] René Carmona (2016). Lectures on BSDEs, stochastic control, and stochastic differential games with financial applications. SIAM
[b9] Zhengping Che; Sanjay Purushotham; Kyunghyun Cho; David Sontag; Yan Liu (2018). Recurrent neural networks for multivariate time series with missing values. Scientific reports
[b10] Yulia Ricky Tq Chen; Jesse Rubanova; David K Bettencourt;  Duvenaud (2018). Neural ordinary differential equations. Advances in neural information processing systems
[b11] Raphaël Chetrite; Hugo Touchette (2015). Nonequilibrium markov processes conditioned on large deviations. Annales Henri Poincaré
[b12] Nicolas Chopin; Omiros Papaspiliopoulos (2020). An introduction to sequential Monte Carlo. Springer
[b13] Nicolas Chopin; Andras Fulop; Jeremy Heng; Alexandre H Thiery (2023). Computational doob's h-transforms for online filtering of discretely observed diffusions. 
[b14] Junyoung Chung; Caglar Gulcehre; Kyunghyun Cho; Yoshua Bengio (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. 
[b15] Emmanuel De Bézenac; Syama Sundar Rangapuram; Konstantinos Benidis; Michael Bohlke-Schneider; Richard Kurle; Lorenzo Stella; Hilaf Hasson; Patrick Gallinari; Tim Januschowski (2020). Normalizing kalman filters for multivariate time series analysis. Advances in Neural Information Processing Systems
[b16] Edward De Brouwer; Jaak Simm; Adam Arany; Yves Moreau (2019). Gru-ode-bayes: Continuous modeling of sporadically-observed time series. Advances in neural information processing systems
[b17] P Del Moral (2011). Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Probability and Its Applications. Springer
[b18] Wei Deng; Weijian Luo; Yixin Tan; Marin Biloš; Yu Chen; Yuriy Nevmyvaka; Ricky Tq Chen (2024). Variational schr\" odinger diffusion models. 
[b19] Alexander Denker; Francisco Vargas; Shreyas Padhy; Kieran Didi; Simon Mathis; Vincent Dutordoir; Riccardo Barbano; Emile Mathieu; Urszula ; Julia Komorowska; Pietro Lio (2024). Deft: Efficient finetuning of conditional diffusion models by learning the generalised h-transform. 
[b20] Diederik Kingma (2014). Adam: A method for stochastic optimization. 
[b21] Andreas Doerr; Christian Daniel; Martin Schiegg; Nguyen-Tuong Duy; Stefan Schaal; Marc Toussaint; Trimpe Sebastian (2018). Probabilistic recurrent state-space models. PMLR
[b22] L Joseph;  Doob (1957). Conditional brownian motion and the boundary limits of harmonic functions. Bulletin de la Société mathématique de France
[b23] H Wendell; Halil Fleming; Soner Mete (2006). Controlled Markov processes and viscosity solutions. Springer Science & Business Media
[b24] Marco Fraccaro; Simon Kamronn; Ulrich Paquet; Ole Winther (2017). A disentangled recognition and nonlinear dynamics model for unsupervised learning. Advances in neural information processing systems
[b25] Albert Gu; Tri Dao (2023). Mamba: Linear-time sequence modeling with selective state spaces. 
[b26] Albert Gu; Karan Goel; Christopher Ré (2021). Efficiently modeling long sequences with structured state spaces. 
[b27] Pieralberto Guarniero; Adam M Johansen; Anthony Lee (2017). The iterated auxiliary particle filter. Journal of the American Statistical Association
[b28] Jiequn Han; Ruimeng Hu (2020). Deep fictitious play for finding markovian nash equilibrium in multi-agent games. PMLR
[b29] Carsten Hartmann; Lorenz Richter; Christof Schütte; Wei Zhang (2017). Variational characterization of free energy: Theory and algorithms. Entropy
[b30] Jeremy Heng; Adrian N Bishop; George Deligiannidis; Arnaud Doucet (2020). Controlled sequential monte carlo. The Annals of Statistics
[b31] Jeremy Heng; Valentin De Bortoli; Arnaud Doucet; James Thornton (2021). Simulating diffusion bridges with score matching. 
[b32] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems
[b33] Cagatay Valerii Iakovlev; Markus Yildiz; Harri Heinonen;  Lähdesmäki (2023). Latent neural ODEs with sparse bayesian multiple shooting. 
[b34] H Andrew;  Jazwinski (2007). Stochastic processes and filtering theory. Courier Corporation
[b35] Patrick Kidger; James Morrill; James Foster; Terry Lyons (2020). Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems
[b36] P E Kloeden; E Platen (2013). Numerical Solution of Stochastic Differential Equations. Springer
[b37] Alexej Klushyn; Richard Kurle; Maximilian Soelch; Botond Cseke; Patrick Van Der Smagt (). Latent matters: Learning deep state-space models. 
[b38]  Curran Associates;  Inc (2021). . 
[b39] Alexej Klushyn; Richard Kurle; Maximilian Soelch; Botond Cseke; Patrick Van Der Smagt (2021). Latent matters: Learning deep state-space models. Advances in Neural Information Processing Systems
[b40] Dieterich Lawson; Allan Raventós; Andrew Warrington; Scott Linderman (2022). Sixo: Smoothing inference with twisted objectives. Advances in Neural Information Processing Systems
[b41] Christian Léonard (2011). Stochastic derivatives and generalized h-transforms of markov processes. 
[b42] Christian Léonard (2013). A survey of the schr\" odinger problem and some of its connections with optimal transport. 
[b43] Tan Alexander K Lew; Gabriel Zhi-Xuan;  Grand;  Vikash;  Mansinghka (2023). Sequential monte carlo steering of large language models using probabilistic programs. 
[b44] Mario Lezcano; Casado  (2019). Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems
[b45] Xuechen Li; Ting-Kam Leonard Wong; Ricky T Q Chen; David Duvenaud (2020). Scalable gradients for stochastic differential equations. 
[b46] Guan-Horng Liu; Yaron Lipman; Maximilian Nickel; Brian Karrer; Evangelos Theodorou; Ricky T Q Chen (2024). Generalized schrödinger bridge matching. 
[b47] Xingchao Liu; Lemeng Wu; Mao Ye;  Liu (2023). Learning diffusion bridges on constrained domains. 
[b48] Jianfeng Lu; Yuliang Wang (2024). Guidance for twisted particle filter: a continuous-time perspective. 
[b49]  Mj Menne; R S Williams; Data Vose;  Files (2015). Long-term daily climate records from stations across the contiguous united states. 
[b50] Bernt Oksendal (1992). Stochastic Differential Equations : An Introduction with Applications. Springer-Verlag
[b51] Byoungwoo Park; Jungwon Choi; Sungbin Lim; Juho Lee (2024). Stochastic optimal control for diffusion bridges in function spaces. 
[b52] Stefano Peluchetti (2023). Diffusion bridge mixture transports, schrödinger bridge problems and generative modeling. Journal of Machine Learning Research
[b53] Angus Phillips; Hai-Dang Dau; Michael John Hutchinson; Valentin De Bortoli; George Deligiannidis; Arnaud Doucet (2024). Particle denoising diffusion sampler. 
[b54] Syama Sundar Rangapuram; Matthias W Seeger; Jan Gasthaus; Lorenzo Stella; Yuyang Wang; Tim Januschowski (2018). Deep state space models for time series forecasting. Advances in neural information processing systems
[b55] Lorenz Richter; Julius Berner (2022). Robust sde-based variational formulations for solving linear pdes via deep learning. PMLR
[b56] H Risken; T Frank (2012). The Fokker-Planck Equation: Methods of Solution and Applications. Springer
[b57] G Chris; David Rogers;  Williams (2000). Diffusions, Markov processes and martingales. Cambridge university press
[b58] Yulia Rubanova; Ricky Tq Chen; David K Duvenaud (2019). Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems
[b59] S Särkkä (2013). Bayesian Filtering and Smoothing. Bayesian Filtering and Smoothing. Cambridge University Press
[b60] S Särkkä; A Solin (2019). Applied Stochastic Differential Equations. Cambridge University Press
[b61] Simo Särkkä; F Ángel;  García-Fernández (2020). Temporal parallelization of bayesian smoothers. IEEE Transactions on Automatic Control
[b62] Mona Schirmer; Mazin Eltayeb; Stefan Lessmann; Maja Rudolph (2022). Modeling irregular time series with continuous recurrent units. PMLR
[b63] Yuyang Shi; Valentin De Bortoli; Andrew Campbell; Arnaud Doucet (2024). Diffusion schrödinger bridge matching. Advances in Neural Information Processing Systems
[b64] Satya Narayan; Shukla Benjamin M Marlin (2021). Multi-time attention networks for irregularly sampled time series. 
[b65] Ikaro Silva; George Moody; J Daniel; Leo A Scott; Roger G Celi;  Mark (2012). Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. IEEE
[b66] T H Jimmy; Andrew Smith; Scott Warrington;  Linderman (2023). Simplified state space layers for sequence modeling. 
[b67] Yang Song; Jascha Sohl-Dickstein; P Diederik; Abhishek Kingma; Stefano Kumar; Ben Ermon;  Poole (2020). Score-based generative modeling through stochastic differential equations. 
[b68] Belinda Tzen; Maxim Raginsky (2019). Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. 
[b69] Ramon Van Handel (2007). Stochastic calculus, filtering, and stochastic control. 
[b70] Francisco Vargas; Andrius Ovsianas; David Fernandes; Mark Girolami; Neil D Lawrence; Nikolas Nüsken (2023). Bayesian learning via neural schrödinger-föllmer flows. Statistics and Computing
[b71] Mao Ye; Lemeng Wu;  Liu (2022). First hitting diffusion models for generating manifold, graph and categorical data. 
[b72] Sebastian Zeng; Florian Graf; Roland Kwitt (2023). Latent SDEs on homogeneous spaces. 
[b73] Qinsheng Zhang; Yongxin Chen (2022). Path integral sampler: A stochastic control approach for sampling. 
[b74] Stephen Zhao; Rob Brekelmans; Alireza Makhzani; Roger Grosse (2024). Probabilistic inference in language models via twisted sequential monte carlo. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Conceptual illustration. Given the observed time stamps T = {t i } i∈[1:4] and the unseen time stamps T u (× in figure), the encoder maps the input time series {o t } t∈T into auxiliary variables {y t } t∈T . These auxiliary variables are then utilized to compute the control policies {α} i∈[1:5] through a masked attention mechanism that relies on two different assimilation schemes. The computed policies {α} i∈[1:5] control the prior dynamics P over the interval [0, T ] to approximate the posterior P ⋆ in the latent space. Finally, the sample path X α 0:T ∼ P α are decoded to generate predictions across the complete time stamps T ′ = T ∪ T u (• in figure), over the entire interval [0, T ].
Data: 

Figure fig_1: 
Type: figure
Caption: Furthermore, since observations are updated only at discrete time steps {t i } i∈[1:k] , the latent variables z t remain constant within any interval t ∈ [t i-1 , t i ) for all i ∈ [1 : k], making A i and α i constant as well. As a result, the dynamics (21) remain linear over local intervals. This structure enables us to derive a closed-form solution for the intermediate latent states. Theorem 3.8 (Simulation-free estimation). Let us consider sequences of SPD matrices {A i } i∈[1:k]
Data: 

Figure fig_2: 
Type: figure
Caption: )Definition B.2 (Path Measure). Let us consider a path-sequence of random variables {X ti } i∈[1:N ] over an interval 0 = t i ≤ • • • ≤ t N = T , where each X ti taking values in measurable space (R d , B(R d )) with a Borel σ-algebra B(R d ) . Then, the set of random variables {X ti } i∈[1:N ] is given by the probability measure P ∈ C([0, T ], R d ):
Data: 

Figure fig_3: 
Type: figure
Caption: and 1 = (1, • • • , 1) ∈ R d . For a parallel scan, we will define the sequence of tuple{M i } i∈[1:k] , such that each element is M i = ( Di , Ĉi α i ) and {S i } i∈[1:k] , such that each element is S i = ( Di , Ci 1) for {m ti } i∈[1:k] and {Σ ti } i∈[1:k] , respectively.Now, let us define a binary operator ⊗:M s ⊗ M t = ( Dt • Ds , Dt • Ĉs αs + Ĉt αt )(195)S s ⊗ S t = ( Dt • Ds , Dt • Cs 1 + Ct 1)
Data: 

Figure fig_4: 
Type: figure
Caption: (M s ⊗ M t ) ⊗ M u = ( Dt • Ds , Dt • Ĉs αs + Ĉt αt ) ⊗ ( Du , Ĉu αu ) (197) = Du • Dt • Ds , Du • Dt • Ĉs αs + Ĉt αt + Ĉu αu (198) = Du • Dt • Ds , Du • Dt • Ĉs αs + Du • Ĉt αt + Ĉu αu (199) = M s ⊗ (M t ⊗ M u ) (200) Thus we get (M s ⊗ M t ) ⊗ M u = M s ⊗ (M t ⊗ M u ). Moreover, we can get similar results for S i :(S s ⊗ S t ) ⊗ S u = ( Dt • Ds , Dt • Cs 1 + Ct 1) ⊗ ( Du , Cu 1) (201) = Du • Dt • Ds , Du • Dt • Cs 1 + Ct 1 + Cu 1 (202) = Du • Dt • Ds , Du • Dt • Cs 1 + Du • Ct 1 + Cu 1 (203) = S s ⊗ (S t ⊗ S u )(204)Now, both means and covariances along the interval {m ti } i∈[1:k] and {Σ ti } i∈[1:k] can be computed (parallel in time K) using a parallel scan algorithm described in Algorithm 1. Algorithm 2 ParallelScan 1: Input. Sequence of tuples {F 1 , F 2 , . . . , F K }, associative operator ⊗. 2: Up-Sweep Stage. 3: for d = 0 to ⌈log 2 K⌉ -1 do 4:
Data: 

Figure fig_6: 
Type: figure
Caption: Fi-2 d = F i-2 d ⊗ F i 18: F i = F i-2 d for 22: for i = 1 to K do in parallel 23: F ′ i = F 1 ⊗ F 2 ⊗ • • • ⊗ F i 24: end for 25: Return Scanned sequence {F ′ 1 , F ′ 2 , . . . , F ′ K }.
Data: 

Figure fig_7: 3
Type: figure
Caption: Figure 3 :3Figure 3: Example of the pendulum sequence. (Up) The input image sequences {o} t∈T observed at irregular time stamps. (Down) The angular values of sin(θ t ) and cos(θ t ) where θ t represents the angle of the pendulum at time t ∈ [0, 100], are used as regression targets.
Data: 

Figure tab_0: 
Type: table
Caption: 
Data: 

Figure tab_1: 2
Type: table
Caption: Test MSE (×10 -3 ).
Data: ModelMSELatent-ODE  † 15.70 ± 0.29CRU  †4.63 ± 1.07Latent-SDEH  ‡ 3.84 ± 0.35S5 ⋆3.41 ± 0.27mTAND  ‡3.20 ± 0.60

Figure tab_2: 3
Type: table
Caption: Test MSE (×10 -2 ) for inter/extra-polation on USHCN and Physionet.
Data: 4.2 SEQUENCE INTERPOLATION & EXTRAPOLATIONDatasets We benchmark the models on two real-world datasets, USHCN and Physionet. TheUSHCN dataset (Menne et al., 2015) contains 1,218 daily measurements from weather stations across

Figure tab_3: 
Type: table
Caption: , X t )Y t dt + e -T A t h t dt + e -T t f (s,Xs)ds ∇ x h(t, X t ) ⊤ dW t .
Data: t f (s,Xs)ds∂h ∂t+

Figure tab_4: 
Type: table
Caption: Published as a conference paper at ICLR 2025 Algorithm 1 Parallel Scan for Mean and Covariance 1: Input. Given time stamps T = {t 1 , t 2 , . . . , t K }, initial mean m t0 and covariance Σ t0 , control policies {α
Data: 

Figure tab_5: 4
Type: table
Caption: Training Hyper-parameters Dataset Learning Rate Train Epoch Time Scale Batch Size R d # of base matrices (L) # of parameters
Data: Human Activity1 × 10 -34001/2212562882561.65MPendulum1 × 10 -35000.150201519.6KUSHCN1 × 10 -35000.250202018.5KPhysionet1 × 10 -35000.3100242028.5KD.2 MASKING SCHEMEFigure D.2 provides a detailed illustration of the masked attention mechanism described in theassimilation schemes introduced in Section 3.3.D.3 TRAINING DETAILS


Formulas:
Formula formula_0: P [•] = E P [•|X t = x],

Formula formula_1: t := (X (•) t ) # P (•) with marginal density p (•)

Formula formula_2: (•) t (x) = p (•)

Formula formula_3: (Prior State) dX t = b(t, X t )dt + dW t ,(1)

Formula formula_4: (Posterior Dist.) dP ⋆ (X 0:T |H t k ) = 1 Z(H t k ) K i=1 g i (y ti |X ti )dP(X 0:T )(2)

Formula formula_5: f i (y ti |x ti ) = g i (y ti |x ti ) L i (g i ) ,(3)

Formula formula_6: {h i } i∈[1:k] , where each h i : [t i-1 , t i ) × R d → R + , for all i ∈ [1 : k], is a conditional expectation h i (t, x t ) := E P k j≥i f j (y tj |X tj )|X t = x t , where {f i } i∈[1:k] is defined in (3). Now, we define a function h : [0, T ] × R d → R + by integrating the functions {h i } i∈[1:k] , h(t, x) := k i=1 h i (t, x)1 [ti-1,ti) (t).

Formula formula_7: (Conditioned State) dX ⋆ t = [b(t, X ⋆ t ) + ∇ x log h(t, X ⋆ t )] dt + dW t (5)

Formula formula_8: [0, T ] × R d → R d : (Controlled State) dX α t = [b(t, X α t ) + α(t, X α t )] dt + d Wt ,(6)

Formula formula_9: J (t, x t , α) = E P α   T t 1 2 ∥α(s, X α s )∥ 2 ds - i:{t≤ti} log f i (y ti |X α ti )|X α t = x t   .(7)

Formula formula_10: V i ∈ C 1,2 ([t i-1 , t i ) × R d ) V i (t, x t ) := min α∈A E P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )|X t = x t ,(8)

Formula formula_11: V(t, x t ) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   ,(9)

Formula formula_12: I(u) = max{i ∈ [1 : k]|t i ≤ u} and f i = f i (y ti |X α ti )

Formula formula_13: V i (t, x) ∈ C 1,2 ([t i-1 , t i ), R d ), for all i ∈ [1 : k],

Formula formula_14: ∂ t V i,t + A t V i,t + min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = 0, t i-1 ≤ t < t i (10) V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x), t = t i , ∀i ∈ [1 : k],(11)

Formula formula_15: ⋆ i (t, x) = ∇ x V i (t, x). Now, define a function α : [0, T ]×R d → R d by integrating the optimal controls {α i } i∈[1:k] , α ⋆ (t, x) := k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t) (12) Then, V(t, x t ) = J (t, x t , α ⋆ ) ≤ J (t, x t , α) holds for any (t, x t ) ∈ [0, T ] × R d and α ∈ A.

Formula formula_16: ∂ t h i,t + A t h i,t = 0, t i-1 ≤ t < t i (13) h i (t i , x) = f i (y ti |x)h i+1 (t i , x), t = t i , ∀i ∈ [1 : k].(14)

Formula formula_17: D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] = L(α) + log Z(H t k ) ≥ 0, (15

Formula formula_18: )

Formula formula_19: L(α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti ) ≥ -log Z(H t k ).(16)

Formula formula_20: dX t = [-AX t + α] dt + dW t , where X 0 ∼ N (m 0 , Σ 0 ),(17)

Formula formula_21: m t = e -At m 0 -A -1 (e -At -I)α(18)

Formula formula_22: Σ t = e -At Σ 0 e -A ⊤ t + t 0 e -A(t-s) e -A ⊤ (t-s) ds.(19)

Formula formula_23: d Xt = -D Xt + α dt + d Ŵt , where X0 ∼ N ( m0 , Σ0 ),(20)

Formula formula_24: Xt = E ⊤ X t , α = E ⊤ α, Ŵt = E ⊤ W t , mt = E ⊤ m t and Σt = E ⊤ Σ t E. Note that Ŵt d = E ⊤ W t for any t ∈ [0, T ]

Formula formula_25: m t = E mt , Σ t = E Σt E ⊤ .

Formula formula_26: dX α t = [-A t X t + α t ] dt + d Wt ,(21

Formula formula_27: A t = L l=1 w (l) θ (z t )A (l) , α t = B θ z t . (22

Formula formula_28: )

Formula formula_29: A i = ED i E ⊤ with E ∈ R d×d and D i ∈ diag(R d ) ⪰ 0 for all i ∈ [1 : k], control vectors {α i } i∈[1:k]

Formula formula_30: dX t = [-A i X t + α i ] dt + σdW t , t ∈ [t i-1 , t i ).

Formula formula_31: m ti = E e -i j=1 (tj -tj-1)Dj mt0 - i k=1 e -i j=k (tj -tj-1)Dj D -1 k I -e (t k -t k-1 )D k αk , Σ ti = E e -2 i j=1 (tj -tj-1)Dj Σt0 - 1 2 i k=1 e -2 i j=k (tj -tj-1)Dj D -1 k I -e 2(t k -t k-1 )D k E ⊤ , where mti = E ⊤ m ti , Σti = E ⊤ Σ ti E and αi = E ⊤ α i for all i ∈ [1 : k].

Formula formula_32: [s t1 , • • • s t K ],

Formula formula_33: [s t1 , (s t1 ⊗ s t2 ), • • • , (s t1 ⊗ s t2 ⊗ • • • ⊗ s t K )](24

Formula formula_34: θ m+1 = θ m -∇ θ L(α θ m ) yields L(α ⋆ ) ≈ L(α θ m →θ M ).

Formula formula_35: q ϕ (y 0:T |o 0:T ) = k i=1 q ϕ (y ti |o ti ) = k i=1 N (y ti |q ϕ (o ti ), Σ q ) (25)

Formula formula_36: p ψ (o 0:T | y 0:T ) = k i=1 p ψ (o ti | y ti ),(26)

Formula formula_37: log p ψ (o 0:T ) ≥ E H T ∼q ϕ (y 0:T |o 0:T ) log K i=1 p ψ (o ti |y ti )g(y 0:T ) K i=1 q ϕ (y ti |o ti ) (27) ≥ E H T ∼q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(θ) = ELBO(ψ, ϕ, θ)(28)

Formula formula_38: 4.1 PER TIME POINT CLASSIFICATION & REGRESSION Table 1: Test Accuracy (%). Model Acc Latent-ODE † 87.0 ± 2.8 Latent-SDEH ‡ 90.6 ± 0.4 mTAND † 91.1 ± 0.2 ACSSM (Ours) 91.4 ± 0.4 † result from (Shukla & Marlin, 2021). ‡ result from

Formula formula_39: T u = {t i } i∈[k+1:N ] , i.e., T ′ = T ∪ T u .

Formula formula_40: (Latent transition) X ti ∼ p i (x ti-1 , dx ti ), X 0 ∼ p 0 (X 0 ) (Observation) y ti ∼ g ti (y ti |X ti ).

Formula formula_41: p(X 0:T |H t k ) = 1 Z(H t k ) p(H t k |X 0:T )p(X 0:T ),(30)

Formula formula_42: Z(H t k ) = p(H t k |X 0:T )p 0 (X 0 ) k i=1 p i (X i-1 , X i )dX 0:T . (31

Formula formula_43: )

Formula formula_44: p(X t |H t k ) ∝ p(X t |H t ) p(H t:t k |X t:T )dX t:T (32) = p(H t |X t )p(X t ) p(H t:t k |X t:T )dX t:T .(33)

Formula formula_45: X t k ∈ A k ) for any A i ∈ B(R d ) and i ∈ [1 : k].

Formula formula_46: |b(t, w, x) -b(t, w, x ′ )| ≤ c 0 |x -x ′ |.

Formula formula_47: x ∈ R d , the F t -progressively measurable processes b(t, x) t∈[0,T ] satisfy E T 0 |b s | 2 ds < ∞ and |b(t, x)| ≤ c 1 (1 + |x|) for t ∈ [0, T ] and c 1 > 0. • (Control function): For any t ∈ [0, T ], w ∈ Ω, x ∈ R d , and θ, θ ′ ∈ Θ, the control function α is L- Lipschitz function, |α(t, x, θ) -α(t, x, θ ′ )| ≤ L|θ -θ ′ |. Moreover it satisfy E T 0 |α 2 s |ds < ∞. Definition B.1 (Infinitesimal Generator).

Formula formula_48: dX t = b(t, X t )dt + σ(t) ⊤ dW t ,(34)

Formula formula_49: A t f = lim t↓0 + E [f (X t )] -f (x) t = ∇ x f ⊤ b + 1 2 Trace σσ ⊤ ∇ xx f .(35

Formula formula_50: P(X t0 ∈ dx t0 , • • • , X t N ∈ dx t N ) = P(dx 0 ) N i=1 P i (x ti-1 , dx ti ),(36)

Formula formula_51: {P i } N i=0 is a sequence of probability kernels from (R d , B(R d )) to (R d , B(R d )), for any event A ∈ B(R d ), P i (x ti-1 , A) = A p i (x ti-1 , x ti )dx ti , where p i (x ti-1 , x ti ) := p(t i , x ti |t i-1 , x ti-1 )

Formula formula_52: ∂ t p t (x t ) = A ⋆ t p t = -∇ x • (bp t ) + 1 2 Trace σσ ⊤ ∇ xx p t ,(37)

Formula formula_53: dv(t, X t ) = [∂ t v(t, X t ) + A t v(t, X t )] dt + ∇ x v(t, X t ) ⊤ σ(t)dW t . (38

Formula formula_54: )

Formula formula_55: dX t = b(t, X t )dt + σ(t, X t ) ⊤ dW t , t ∈ [0, T ],(39)

Formula formula_56: dY t = b(t, Y t )dt + σ(t, Y t ) ⊤ dW t , t ∈ [0, T ],(40)

Formula formula_57: M t := exp 1 0 H ⊤ s dW s - 1 2 t 0 ∥H s ∥ 2 ds (41) satisfies E P [M T ] = 1.

Formula formula_58: dY t = b(t, Y t )dt + σ(t, Y t ) ⊤ d Wt , t ∈ [0, T ]. (42

Formula formula_59: )

Formula formula_60: {f i } i∈[1:k] in (3). By definition, it satisfied that k i=1 L i (g i ) = k i=1 R d g i (y ti |x ti )dP(x 0:T ) (i) = E P k i=1 g i (y ti |x ti ) = Z(H t k ),(43)

Formula formula_61: E P k i=1 f i (x ti ) = E P k i=1 g i (y ti |x ti ) k i=1 L ti (g ti ) = 1 Z(H t k ) E P k i=1 g i (y ti |x ti ) = 1.(44)

Formula formula_62: {h i } i∈[1:k] ,

Formula formula_63: h i : [t i-1 , t i ) × R d → R + , for all i ∈ [1 : k], is a conditional expectation h i (t, x t ) := E P k j≥i f j (y tj |X tj )|X t = x t , where {f i } i∈[1:k] is defined in (3). Now, we define a function h : [0, T ] × R d → R + by integrating the functions {h i } i∈[1:k] , h(t, x) := k i=1 h i (t, x)1 [ti-1,ti) (t). (45

Formula formula_64: )

Formula formula_65: (Conditioned State) dX ⋆ t = [b(t, X ⋆ t ) + ∇ x log h(t, X ⋆ t )] dt + dW t (46)

Formula formula_66: P hi i (X ti ∈ A|X t = x) := P hi i (x t , A) = h i (t i , X ti ) h i (t, x t ) P ti (x t , dx ti ). (47

Formula formula_67: )

Formula formula_68: R d P hi i (x t , dx ti ) = R d h i (t i , X ti ) h i (t, x t ) P i (x t , dx ti ) (48) = R d h i (t i , X ti )P i (x t , dx ti ) h i (t, x t ) (49

Formula formula_69: ) (i) = R d f i (y ti |X ti )h i+1 (t i , X ti )P i (x t , dx ti ) E P k j=i f j (y tj |X tj )|X t = x t (50) = E P k j=i f j (y tj |X tj )|X t = x t E P k j=i f j (y tj |X tj )|X t = x t = 1,(51)

Formula formula_70: h i (t i , x ti ) = f i (y ti |X ti )h i+1 (t i , x ti ), ∀i ∈ [1 : k -1].(52)

Formula formula_71: A hi t φ t = lim s↓0 E P h [φ(t s , X t+s )|X t = x] -φ(t, x) s (53) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] P h i i (x,dxt+s) Pi(x,dxt+s) |X t = x s (54) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] hi(t+s,Xt+s) hi(t,x) |X t = x s (55) = lim s↓0 E P [φ(t s , X t+s ) -φ(t, x)] hi(t+s,Xt+s)-hi(t,x) hi(t,x) + 1 |X t = x s (56

Formula formula_72: ) (i) = A t φ t + lim s↓0 E P [[φ(t s , X t+s ) -φ(t, x)] [h i (t + s, X t+s ) -h i (t, x)] |X t = x] sh i (t, x) ,(57)

Formula formula_73: φ t+s h i,t+s = φ 0 h i,0 + t+s 0 φ u dh i,u + t+s 0 h i,u dφ u + [φ, h i ] t+s(58)

Formula formula_74: φ t h i,t = φ 0 h i,0 + t 0 φ u dh i,u + t 0 h i,u dφ u + [φ, h i ] t ,(59)

Formula formula_75: φ t+s h i,t+s -φ t h i,t = t+s t φ u dh i,u + t+s t h i,u dφ u + [φ, h i ] t+s -[φ, h i ] t(60)

Formula formula_76: (φ t+s -φ t )(h i,t+s -h i,t ) = φ t+s h i,t+s -φ t h i,t -φ t (h i,t+s -h i,t ) -h i,t (φ t+s -φ t ) (61) = t+s t (φ u -φ t )dh i,u + t+s t (h i,u -h i,t )dφ u + [φ, h i ] t+s -[φ, h i ] t ,

Formula formula_77: dφ t = A t φ t dt + (∇ x φ) ⊤ dW t , dh i,t = A t h i,t dt + (∇ x h i,t ) ⊤ dW t .(62)

Formula formula_78: E P [(φ t+s -φ t )(h i,t+s -h i,t )|X t = x] (63) = E P t+s s (φ u -φ t )A u h i,u du + t+s s (h i,u -h i,t )A u φ u du + [φ, h i ] t+s -[φ, h i ] t |X t = x (64) = E t,x P t+s t (φ u -φ t )A u h i,u du (A) + E t,x P t+s t (h i,u -h i,t )A u φ u du (B) + E t,x P [[φ, h i ] t+s -[φ, h i ] t ] (C)

Formula formula_79: E t,x P t+s t (φ u -φ t )A u h i,u du ≤ E t,x P t+s t |φ u -φ t | q du 1/q E t,x P t+s t |A t h i,u | p du 1/p (65) = E t,x P t+s t |φ u -φ t | q du 1/q t+s t E t,x P [|A t h i,u | p ] du 1/p (66)

Formula formula_80: P t+s t |φ u -φ t | q du ) 1/q = (E t,x P lim s↓0 t+s t

Formula formula_81: h i ∈ C 1,2 ([t i-1 , t i ), R d

Formula formula_82: |A u h i,u | p ≤ |∂ t h i,t | p + |(∇ x h T i,t )b| p + | 1 2 Trace [∇ xx h i,t ] | p < ∞,(67)

Formula formula_83: sup u∈[t,t+s] E t,x P [|A u h i,u | p ] < ∞, ∀t ∈ [0, T ],

Formula formula_84: (h i,u -h i,t )A u φ u du = 0.

Formula formula_85: E t,x P [[φ, h i ] t+s -[φ, h i ] t ] = E t,x P t+s t dφ u dh i,u = E t,x P t+s t (∇ x φ u ) ⊤ ∇ x h i,u du(68)

Formula formula_86: lim s↓0 E P [[φ(t s , X t+s ) -φ(t, x)] [h i (t + s, X t+s ) -h i (t, x)] |X t = x] sh i (t, x) (69) = lim s↓0 E P t+s t (∇ x φ(u, X u )) ⊤ ∇ x h i (u, X u )du|X t = x sh i (t, x) (70) = (∇ x φ(t, X t )) ⊤ ∇ x log h i (t, X t ),(71)

Formula formula_87: A hi t φ t = A t φ t + (∇ x φ t ) ⊤ ∇ x log h i,t(72)

Formula formula_88: dX h t = [b(t, X t ) + ∇ x log h i (t, X t )] dt + dW t (73)

Formula formula_89: A h t φ = A t φ t + k i=1 (∇ x φ t ) ⊤ ∇ x log h i,t 1 [ti-1,ti) (t) (74) = A t φ t + (∇ x φ t ) ⊤ ∇ x log h t . (75

Formula formula_90: )

Formula formula_91: dX h t = [b(t, X t ) + ∇ x log h(t, X t )] dt + dW t .(76)

Formula formula_92: dP h (x 0:T ) = dµ ⋆ 0 (x 0 ) k i=1   N j=1 P hi i(j) (x t i(j-1) , dx ti(j) )   (77) = dµ ⋆ 0 (x 0 ) k i=1 h i (t i , x ti ) h i (t i-1 , x ti-1 )   N j=1 P i(j) (x t i(j-1) , dx ti(j) )   (78) = dµ ⋆ 0 (x 0 ) k i=1 h i+1 (t i , x ti )f i (y ti |x ti ) h i (t i-1 , x ti-1 )   N j=1 P i(j) (x t i(j-1) , dx ti(j) )   (79) N ↑∞ = dµ ⋆ 0 dµ 0 (x 0 ) h k+1 (t k , x t k ) h 1 (t 0 , x 0 ) k i=1 f i (y ti |x ti )dP(x 0:T ) (80)

Formula formula_93: ) = i -1, i(1) = i -1 + 1 N and i(N ) = i. Hence, for a dµ ⋆ 0 (x 0 ) = h 1 (t 0 , x 0 )dµ 0 (x 0 ) and h k+1 = 1 yields dP h (x 0:T ) = k i=1 f i (y ti |x ti )dP(x 0:T ) (81) = 1 Z(H t k ) k i=1 g i (y ti |x ti )dP(x 0:T ) (82) = dP ⋆ (x 0:T ).(83)

Formula formula_94: V i ∈ C 1,2 ([t i-1 , t i ) × R d ) V i (t, x t ) := min α∈A E P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )|X t = x t ,(84)

Formula formula_95: V(t, x t ) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   , (85) with the indexing function I(u) = max{i ∈ [1 : k]|t i ≤ u}.

Formula formula_96: J i (t, x t , α) := E t,xt P α ti ti-1 1 2 ∥α s ∥ 2 ds -log f i (y ti |X α ti ) + J i+1 (t i , X α ti , α) ,(86)

Formula formula_97: E t,x P [•] = E P [•|X t = x]

Formula formula_98: J (t, x t , α) = E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + J I(u)+1 (t I(u) , X α t I(u) , α)   ,(87)

Formula formula_99: I(u) = max{i ∈ [1 : k]|t i ≤ u}.

Formula formula_100: ′ ∈ A[t, T ] such that V(t, x) + ϵ ≥ J (t, x, α ′ ) (88) = E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + J I(u)+1 (t I(u) , X α ′ t I(u) , α ′ )   (89) ≥ E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )  (90)

Formula formula_101: α ′ ∈A[t,T ] E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )  (91)

Formula formula_102: V(t, x) ≥ min α ′ ∈A[t,T ] E t,xt P α ′   t I(u) t 1 2 ∥α ′ s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ′ ti ) + V I(u)+1 (t I(u) , X α ′ t I(u) )   .

Formula formula_103: αs := α 1 s , s ∈ [t, t I(u) ) α 2 s s ∈ [t I(u) , T ].(93)

Formula formula_104: J (t, x, α) ≥ min α 2 ∈A[t I(u) ,T ] J (t, x, α)(94)

Formula formula_105: = E t,xt P α 1   t I(u) t 1 2 α 1 s 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α 1 ti ) + V I(u)+1 (t I(u) , X α 1 t I(u) )  (95)

Formula formula_106: α 1 ∈A[t,t I(u) ) E t,xt P α 1   t I(u) t 1 2 α 1 s 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α 1 ti ) + V I(u)+1 (t I(u) , X α 1 t I(u) )   (96) = min α∈A[t,T ] E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )   (97) ≥ V(t, x).(98)

Formula formula_107: V(t, x) = min α∈A E P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )|X α t = x t   .

Formula formula_108: V i (t, x) ∈ C 1,2 ([t i-1 , t i ), R d ), for all i ∈ [1 : k],

Formula formula_109: ∂ t V i,t + A t V i,t + min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = 0, t i-1 ≤ t < t i (100) V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x), t = t i , ∀i ∈ [1 : k],(101)

Formula formula_110: ⋆ i (t, x) = ∇ x V i (t, x). Now, define a function α : [0, t k ]×R d → R d by integrating the optimal controls {α i } i∈{1,••• ,k} , α ⋆ (t, x) := k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t)(102)

Formula formula_111: E t,xt P α V i (t i , X α ti ) = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s ds ,(103)

Formula formula_112: E t,x P [•] = E P [•|X t = x]

Formula formula_113: P α ti t 1 2 ∥α i,s ∥ 2 ds

Formula formula_114: LHS = E t,xt P α V i (t i , X α ti ) + E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds (104) = E t,xt P α V i (t i , X α ti ) + ti t 1 2 ∥α i,s ∥ 2 ds (105

Formula formula_115: ) (i) = E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti )(106)

Formula formula_116: = J i (t, x, α),(107) where (i

Formula formula_117: RHS = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s ds + E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds (108) = V i (t, x) + E t,xt P α ti t ∂ t V i,s + A t V i,s + (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds (109

Formula formula_118: ) (i) = V i (t, x) + E t,xt P α ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds ,(110)

Formula formula_119: J i (t, x, α) = V i (t, x) + E t,xt P α ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds . (111) Due to the fact that ti t (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 -min α∈A (∇ x V i,s ) ⊤ α i,s + 1 2 ∥α i,s ∥ 2 ds ≥ 0,

Formula formula_120: holds for α ⋆ i,t = min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = -∇ x V i,t . Additionaly, it implies that J i (t, x, α ⋆ ) = V i (t, x).

Formula formula_121: V(t, x t ) = min α∈A E t,xt P α   t I(u) t 1 2 ∥α s ∥ 2 ds - i:{t≤ti≤t I(u) } log f i (y ti |X α ti ) + V I(u)+1 (t I(u) , X α t I(u) )   (112) = min α∈A E t,xt P α ti t 1 2 ∥α i,s ∥ 2 ds -log f i (y ti |X α ti ) + V i+1 (t i , X α ti ) (113) = V i (t, x t ).(114)

Formula formula_122: [t i-1 , t i ) is α ⋆ i . Finally, V in (

Formula formula_123: V(t, x t ) = k i=1 V i (t, x t )1 [ti-1,ti) .(115)

Formula formula_124: = k i=1 α ⋆ i (t, x)1 [ti-1,ti) (t).

Formula formula_125: Lemma B.5 (The Feynman-Kac formula). Let us define f ∈ C 2 (R d ) and g ∈ C(R d ). Then, a function h(t, x t ) = E P e -T t f (s,Xs)ds g(X T )|X t =

Formula formula_126: ∂ t h t + A t h t -f h t = 0, 0 ≤ t < T, (116) h(t, x) = g(X T ), t = T.(117)

Formula formula_127: dY t = -f (t, X t )e -T t f (s,Xs)ds h(t, X t )dt + e -T t f (s,Xs)ds dh(t, X t ).(118)

Formula formula_128: dh(t, X t ) = ∂h ∂t + A t h t dt + ∇ x h(t, X t ) ⊤ dW t ,(119)

Formula formula_129: dY t = -f (t

Formula formula_130: ∂h ∂t (t, X t ) + A t h(t, X t ) -f (t, X t )h(t, X t ) = 0,(121)

Formula formula_131: ∂ t h i,t + A t h i,t = 0, t i-1 ≤ t < t i (122) h i (t i , x) = f i (y ti |x)h i+1 (t i , x), t = t i , ∀i ∈ [1 : k].(123)

Formula formula_132: E P k j≥i-1 f j (X tj )|X t = x t .

Formula formula_133: ∂ t h i,t = -h i,t ∂ t V i,t , ∇ x h i,t = -h i,t ∇ x V i,t , ∇ xx h i,t = h i,t (∥∇ x V i,t ∥ 2 -∇ xx V i,t ). (124)

Formula formula_134: h i,t ∂ t V i,t = -∂ t h i,t (i) = A t h i,t(125)

Formula formula_135: = (∇ x h i,t ) ⊤ b t + 1 2 Trace [∇ xx h i,t ] (126) = (-h i,t ∇ x V i,t ) ⊤ b t + 1 2 Trace h i,t (∥∇ x V i,t ∥ 2 -h i,t ∇ xx V i,t ) (127) = (-h i,t ∇ x V i,t ) ⊤ b t + 1 2 Trace h i,t ∥∇ x V i,t ∥ 2 - 1 2 Trace [h i,t ∇ xx V i,t ] ,(128)

Formula formula_136: ∂ t V i,t = (-∇ x V i,t ) ⊤ b t + 1 2 Trace ∥∇ x V i,t ∥ 2 - 1 2 Trace [∇ xx V i,t ] (129) = -A t V i,t + 1 2 ∥∇ x V i,t ∥ 2 (130)

Formula formula_137: ∂ t V i,t + A t V i,t - 1 2 ∥∇ x V i,t ∥ 2 = 0, V i (t i , x) = -log f i (y ti |x) + V i+1 (t i , x).(131)

Formula formula_138: min α∈A (∇ x V i,t ) ⊤ α i,t + 1 2 ∥α i,t ∥ 2 = -1 2 ∥∇ x V i,t ∥

Formula formula_139: Q≪P [E Y∼Q [W(Y 0:T )] + D KL (Q|P)](132)

Formula formula_140: ≤ E Y∼Q W(Y 0:T ) -log dP dQ (Y 0:T ) (134) = E Y∼Q [W(Y 0:T )] + D KL (Q|P),(135)

Formula formula_141: D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] = L(α) + log g(H t k ) ≥ 0,(137)

Formula formula_142: L(α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti ) ≥ -log g(H t k ).(138)

Formula formula_143: D KL (P α |P ⋆ ) (i) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [D KL (P α (•|X α 0 )|P ⋆ (•|X α 0 ))|X ⋆ 0 = x 0 ] (139) =D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 E P α log dP α dP ⋆ (X α 0:T )|X α 0 = x 0 (140) =D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 E P α log dP α dP (X α 0:T ) + log dP dP ⋆ (X α 0:T )|X α 0 = x 0 ,(141)

Formula formula_144: E 0,x0 P α log dP α dP (X α 0:T ) = E 0,x0 P α T 0 α t d Wt + T 0 1 2 ∥α t ∥ 2 dt = E 0,x0 P α T 0 1 2 ∥α t ∥ 2 dt .(142)

Formula formula_145: E 0,x0 P α log dP dP ⋆ (X α 0:T ) = E 0,x0 P α - k i=1 log f i (y ti |X α ti ) . (143

Formula formula_146: )

Formula formula_147: E 0,x0 P α log dP α dP (X α 0:T ) + log dP dP ⋆ (X α 0:T ) = E 0,x0 P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X α ti ) (144) = E 0,x0 P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X α ti ) (145) = J (0, x 0 , α). (146

Formula formula_148: )

Formula formula_149: J (0, x 0 , α) = E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log f i (y ti |X ti )|X α 0 = x 0 (147

Formula formula_150: ) (i) = E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log g i (y ti |X ti )|X α 0 = x 0 + log Z(H t k ),(148)

Formula formula_151: D KL (P α |P ⋆ ) = D KL (µ 0 |µ ⋆ 0 ) + E x0∼µ0 [J (0, x 0 , α)] (149) = D KL (µ 0 |µ ⋆ 0 ) + L(α) + log Z(H t k ).(150)

Formula formula_152: D KL (µ 0 |µ ⋆ 0 ) = E x0∼µ0 log dµ 0 dµ ⋆ 0 (x 0 ) = E x0∼µ0 [-log h 1 (0, x 0 )] (151) = E x0∼µ0 [V(0, x 0 )] (152) = E x0∼µ0 [J (0, x 0 , α ⋆ )] (153

Formula formula_153: ) (i) = E x0∼µ0 J (0, x 0 , α ⋆ ) + log Z(H t k ) (154

Formula formula_154: ) (ii) = L(α ⋆ ) + log Z(H t k ),(155)

Formula formula_155: J (0, x 0 , α) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log f i (y ti |X α ti )|X α 0 = x 0 (156) = E P α T 0 1 2 ∥α(s, X α s )∥ 2 ds - k i log g i (y ti |X α ti )|X α 0 = x 0 J (0,x0,α) + log Z(H t k ).(157)

Formula formula_156: E x0∼µ0 min α∈A J (0, x 0 , α) = min α∈A E x0∼µ0 J (0, x 0 , α) = min α∈A L(α) = L(α ⋆ )(158)

Formula formula_157: E P α T 0 1 2 ∥α(t, X α t )∥ 2 dt - k i=1 log g i (y ti |X α ti ) + log E P k i=1 g i (y ti |X α ti ) ≥ 0,(159)

Formula formula_158: D KL (P α |P ⋆ ) = L(α) + log Z(H t k ) ≥ 0,(160)

Formula formula_159: log p ψ (o 0:T ) ≥ E q ϕ (y 0:T |o 0:T ) log K i=1 p ψ (o ti |y ti )g(y 0:T ) K i=1 q ϕ (y ti |o ti ) (161) = E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) + log g(y 0:T ) -log K i=1 q ϕ (y ti |o ti ) (162) ≥ E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) -log K i=1 q ϕ (y ti |o ti ) (163) = E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) - K i=1 log q ϕ (y ti |o ti ) (164

Formula formula_160: ) (i) ≥ E q ϕ (y 0:T |o 0:T ) K i=1 log p ψ (o ti |y ti ) -L(α) ,(165)

Formula formula_161: A i = ED i E ⊤ with E ∈ R d×d and D i ∈ diag(R d ) ⪰ 0 for all i ∈ [1 : k], control vectors {α i } i∈[1:k]

Formula formula_162: dX t = [-A i X t + α i ] dt + σdW t , t ∈ [t i-1 , t i ).

Formula formula_163: m ti = E e -i j=1 (tj -tj-1)Dj mt0 - i k=1 e -i j=k (tj -tj-1)Dj D -1 k I -e (t k -t k-1 )D k αk , Σ ti = E e -2 i j=1 (tj -tj-1)Dj Σt0 - 1 2 i k=1 e -2 i j=k (tj -tj-1)Dj D -1 k I -e 2(t k -t k-1 )D k E ⊤ , where mti = E ⊤ m ti , Σti = E ⊤ Σ ti E and αi = E ⊤ α i for all i ∈ [1 : k].

Formula formula_165: E P [( Xt -mt )( Xt -mt ) ⊤ ] = Σt , we can compute Σt = E P α e -2∆i(t)Di Xti -mti + M i (t) Xti -mti + M i (t) ⊤ (169

Formula formula_166: ) (i) = e -2∆i(t)Di E P α ( Xti -mti )( Xti -mti ) ⊤ + ∥M i (t)∥ 2 2 (170) (ii) = e -2∆i(t)Di Σti - 1 2 e -2∆i(t)Di D -1 i (I -e 2∆i(t)Di ),(171)

Formula formula_167: E P α ∥M i (t)∥ 2 2 = E P α t ti e ∆i(s)Di 2 2 ds = - 1 2 e -2∆i(t)Di D -1 i (I -e 2∆i(t)Di ).(172)

Formula formula_168: mt1 = e -∆0(t1)D1 mt0 -e -∆0(t1)D D -1 1 (I -e ∆0(t1)D1 )α 1 (173) mt2 = e -2 j=1 ∆j-1(tj )Dj mt0 (174) -e -2 j=1 ∆j-1(tj )Dj D -1 1 (I -e ∆0(t1)D1 ) α1 -e -∆1(t2)D2 D -1 2 (I -e ∆1(t2)D2 ) α2 (175) . . . (176

Formula formula_169: ) mti = e -i j=1 ∆j-1(tj )Dj mt0 - i k=1 e -i j=k ∆j-1(tj )Dj D -1 k I -e ∆ k-1 (t k )D k αk(177)

Formula formula_170: Σt1 = e -2∆0(t1)D1 Σt0 - 1 2 e -2∆0(t1)D D -1 1 (I -e 2∆0(t1)D1 ) (178) Σt2 = e -2 2 j=1 ∆j-1(tj )Dj Σt0 (179) - 1 2 e -2 2 j=1 ∆j-1(tj )Dj D -1 1 (I -e 2∆0(t1)D1 ) - 1 2 e -2∆1(t2)D2 D -1 2 (I -e 2∆1(t2)D2 ) (180) . . . (181

Formula formula_171: ) Σti = e -2 i j=1 ∆j-1(tj )Dj Σt0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj D -1 k I -e 2∆ k-1 (t k )D k (182)

Formula formula_172: E mti = E e -i j=1 ∆j-1(tj )Dj mt0 - i k=1 e -i j=k ∆j-1(tj )Dj D -1 k I -e ∆ k-1 (t k )D k αk (183

Formula formula_173: ) (i) = E e -i j=1 ∆j-1(tj )Dj E ⊤ m t0 - i k=1 e -i j=k ∆j-1(tj )Dj E ⊤ A -1 k E I -e ∆ k-1 (t k )D k E ⊤ α k (184

Formula formula_174: ) (ii) = e -i j=1 ∆j-1(tj )Aj m t0 - i k=1 e -i j=k ∆j-1(tj )Aj A -1 k I -e ∆ k-1 (t k )A k α k (185) = m ti ,(186)

Formula formula_175: D -1 i = E ⊤ A -1 i E and e -D -1 i = E ⊤ e -A -1 i E.

Formula formula_176: E Σti E ⊤ = E e -2 i j=1 ∆j-1(tj )Dj Σt0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj D -1 k I -e 2∆ k-1 (t k )D k E ⊤ (187

Formula formula_177: ) (i) = E e -2 i j=1 ∆j-1(tj )Dj E ⊤ Σ t0 E - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Dj E ⊤ A -1 k E I -e 2∆ k-1 (t k )D k E ⊤ (188) = e -2 i j=1 ∆j-1(tj )Aj Σ t0 - 1 2 i k=1 e -2 i j=k ∆j-1(tj )Aj A -1 k I -e 2∆ k-1 (t k )A k (189) = Σ ti ,(190)

Formula formula_178: i } i∈[1:k] , matrices {D} i∈[1:k] . 2: Compute {∆ i (t i ), Di , Ĉi , Di , Ci } i∈[1:k] 3: Set {M i } i∈[1:k] = { Di , Ĉi αi } i∈[1:k] and {S i } i∈[1:k] = { Di , Ci 1 } i∈[1:k] . 4: Parallel Scan {M ′ i , S ′ i } i∈[1:k] = ParallelScan({M i , S i } i∈[1:k] , ⊗) 5: ⇒ Algorithm 2 for ParallelScan 6: for i = 1 to K do in parallel 7: m ti = M ′ (1) i m t0 + M ′ i (2) 8: Σ ti = S ′ (1) i Σ t0 + S ′ i (2)

Formula formula_179: t2 ), • • • , (s t1 ⊗ s t2 ⊗ • • • ⊗ s t K )] in O(log K) time.

Formula formula_180: mti = Di mti-1 + Ĉi αi (191) Σti = Di Σti-1 + Ci 1,(192)

Formula formula_183: F i = F i-2 d ⊗ F i 8:

